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Abstract 

Present  day  microbial  genomes  are  the  handiwork  of  over  3  billion  years  of  evolution. 
Comparisons  between  these  genomes  enable  stepping  backwards  through  past  evolu¬ 
tionary  events,  and  can  be  formalized  using  binary  tree  models  known  as  phytogenies. 
In  this  thesis,  I  present  three  new  phylogenetic  methods  for  gaining  insight  into  how 
microbes  evolve.  In  Chapter  1,  I  introduce  the  algorithm  AdaptML,  which  uses  strain 
ecology  information  to  identify  genetically-  and  ecologically-distinct  bacterial  popu¬ 
lations.  Analysis  of  1000  marine  Vibrionaceae  strains  by  AdaptML  finds  evidence 
that  niche  adaptation  may  influence  patterns  of  genetic  differentiation  in  bacteria. 
In  Chapter  2,  I  introduce  the  algorithm  AnGST,  which  can  infer  the  evolutionary 
history  of  a  gene  family  in  a  chronological  context.  Analysis  of  3968  gene  families 
drawn  from  100  modern  day  organisms  with  AnGST  reveals  genomic  evidence  for 
a  massive  expansion  in  microbial  genetic  diversity  during  the  Archean  eon  and  the 
gradual  oxygenation  of  the  biosphere  over  the  past  3  billion  years.  Lastly,  I  intro¬ 
duce  in  Chapter  3  the  algorithm  GAnG,  which  can  construct  prokaryotic  species  trees 
from  thousands  of  distinct  gene  trees.  GAnG  analysis  of  archaeal  gene  trees  supports 
hypotheses  that  the  Nanoarchaeota  diverged  from  the  last  ancestor  of  the  Archaea 
prior  to  the  Crenarchaeota/Euryarchaeota  split. 
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Overview 


Present  day  microbial  genomes  are  the  handiwork  of  over  3  billion  years  of  evolution. 
Comparisons  between  these  genomes  enable  stepping  backwards  through  evolutionary 
history  -  genetic  features  present  across  a  wide  diversity  of  genomes  likely  arose 
more  anciently  than  features  found  in  subsets  of  related  genomes.  This  intuition  is 
formalized  using  binary  tree  models  of  sequence  evolution  known  as  phylogenetic  trees. 
Phytogenies  propose  a  series  of  ancestral  sequence  divergence  events  that  explain  the 
similarity  of  extant  sequences.  These  trees  can  in  turn  be  used  to  build  models  for 
the  evolution  of  organismal  phenotypes,  such  as  preferred  environment  or  lifestyle. 
However,  prokaryotes’  capacity  for  horizontal  gene  transfer  (HGT)  can  require  regions 
of  the  same  genome  to  be  associated  with  different  phylogenetic  trees,  and  ultimately 
obscure  which  phylogenetic  tree  best  represents  overall  genome  evolution. 

In  this  thesis,  I  present  three  novel  phylogenetic  approaches  for  inferring  micro¬ 
bial  evolutionary  history  through  the  comparison  of  gene  sequences.  The  remainder 
of  Part  I  briefly  describes  the  research  context  in  which  I  developed:  AdaptML,  an 
algorithm  for  detecting  signatures  of  ecological  adaptation  influencing  bacterial  ge¬ 
netic  differentiation;  and  AnGST,  an  algorithm  for  inferring  the  series  of  HGT,  gene 
duplication,  and  gene  loss  events  that  gave  rise  to  a  gene  family.  I  go  on  in  Chapter  1 
of  Part  II  to  use  AdaptML  to  identify  genetically-  and  ecologically-distinct  clusters  of 
Vibrionaceae  coexisting  in  a  marine  environment.  In  Chapter  2,  I  use  AnGST  to  infer 
patterns  in  microbial  genome  evolution  over  the  past  3.8  billion  years.  I  use  AnGST 
again  in  Chapter  3  in  the  development  of  GAnG,  a  new  method  for  constructing 
prokaryotic  species  trees  from  thousands  of  gene  trees.  I  conclude  this  thesis  in  Part 
III  with  a  summary  of  the  chapters  and  a  brief  discussion  of  ongoing  and  future  work 
with  AdaptML,  AnGST,  and  GAnG. 
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Detecting  relationships  between  genetic  and  ecolog¬ 
ical  differentiation  in  bacteria 

Distinct  groups  of  closely-related  bacteria,  or  phylogenetic  clusters,  are  a  recurring 
pattern  of  genetic  differentiation  among  bacterial  isolate  housekeeping  genes  [1-3]. 
Ecological  adaptation  is  suspected  of  playing  a  role  in  cluster  formation  [4],  Ac¬ 
cording  to  the  ecotype  model,  genetically-distinct  bacterial  populations  form  when 
bacterial  populations  adapt  to  an  ecological  niche  and  are  repeatedly  purged  of  ge¬ 
netic  variation  through  periodic  selection  events  [5].  However,  a  theoretical  study 
has  shown  that  genetically  distinct  sub-populations  can  form  under  a  neutral  model 
that  either  prohibits  recombination,  or  simulates  high  within-cluster  recombination 
[6].  Alternatively,  a  recent  phylogenetic  analysis  of  eight  sequenced  Vibrio  isolates 
has  found  evidence  for  a  combined  model  featuring  both  ecological  adaptation  and 
neutral  processes  contributing  to  genetic  differentiation.  Under  this  model,  the  intro¬ 
duction  of  niche-adaptive  alleles  initially  erodes  sympatry  in  a  bacterial  population. 
Reduced  gene  flow  between  niche-adapted  bacteria  and  the  remaining  population 
subsequently  yields  genetically-distinct  subgroups  [7].  Ultimately,  if  niche  adapta¬ 
tion  drives  the  formation  of  genetically-distinct  bacterial  groups,  members  of  each 
group  should  inhabit  a  common  niche.  Mathematical  models  capable  of  identifying 
both  genetically-  and  ecologically-cohesive  bacterial  groups  can  thus  be  used  to  help 
resolve  the  role  of  ecological  adaptation  in  the  genetic  differentiation  of  bacteria. 

Several  existing  statistical  methods,  such  as  the  Fst  test,  the  P  test,  and  Unifrac, 
can  evaluate  the  null  hypothesis  that  phylogenetic  clusters  do  not  exhibit  distinct 
ecological  associations.  These  tests  assume  that  bacterial  sequences  are  annotated 
with  ecological  metadata  describing  the  environment  each  sequence  was  harvested 
from.  The  Fst  test  compares  the  genetic  diversity  among  bacteria  annotated  as 
sharing  the  same  environment  to  the  genetic  diversity  measured  across  all  sampled 
sequences.  Low  genetic  diversity  within  a  particular  environment,  coupled  with  high 
genetic  diversity  between  environments,  is  evidence  for  rejection  of  the  null  hypothesis 
of  no  association  between  genetic  clustering  and  bacterial  ecology  [8].  Alternatively, 
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the  P  test  builds  a  phytogeny  of  strain  sequences,  labels  leaves  by  their  environmental 
association,  and  uses  a  parsimony  model  to  infer  the  number  of  times  ancestral  strains 
on  the  tree  changed  environmental  associations.  Low  parsimony  scores  are  evidence 
for  rejecting  the  null  hypothesis  [8].  Lastly,  the  Unifrac  model  combines  elements  of 
both  the  Fst  and  P  test,  utilizing  genetic  distances  and  strain  tree  topology  to  test 
the  relationship  between  strain  genetic  clustering  and  associated  environment  [9]. 

One  weakness,  however,  of  the  Fst,  P,  and  Unifrac  statistics  is  their  potential  for 
erroneously  reporting  no  association  between  genetic  clustering  and  ecology  when  the 
ecological  forces  driving  cluster  formation  are  unmeasured  or  improperly  annotated. 
For  example,  consider  a  bacterial  sequence  cluster  caused  by  adaptation  to  conditions 
between  20°-30°C.  An  association  between  this  cluster  and  ecology  would  go  unrec¬ 
ognized  by  Fst,  P,  or  Unifrac  analyses  if  temperature  data  was  not  collected,  or  if 
temperature  data  were  discretized  into  only  two  ranges:  <  25°C  and  >25°C.  Thus, 
these  statistics  may  not  be  appropriate  for  analyzing  the  evolution  of  bacteria  whose 
niche  composition  is  unknown  or  highly  uncertain,  as  environmental  parameters  de¬ 
scribing  these  bacteria’s  niche  may  not  have  been  measured. 

Another  inference  algorithm,  Ecotype  Simulation  (ES),  can  identify  genetically- 
and  ecologically-distinct  clusters  in  a  manner  insensitive  to  how  ecological  parameters 
are  measured  [10].  ES  finds  ecotypes  by  fitting  a  maximum  likelihood  model  onto  a 
gene  phylogeny.  This  model  estimates  the  rates  of  ecotype  formation,  periodic  se¬ 
lection,  and  genetic  drift,  as  well  as  the  total  number  of  ecotypes  present.  Identified 
ecotypes  can  subsequently  be  analyzed  using  ecological  measurements  and  multivari¬ 
ate  statistics  in  order  to  confirm  that  niche-adaptation  has  taken  place  and  identify 
environmental  parameters  that  define  the  niche.  Recent  application  of  this  approach 
discovered  ecotypes  among  Bacillus  strains  sampled  from  Death  Valley,  CA,  which 
could  be  distinguished  by  adaptation  to  solar  exposure  and  soil  texture  [11].  One 
drawback  to  the  ES  algorithm,  however,  is  that  it  cannot  detect  nascent  ecotype 
formation  events  that  have  not  yet  undergone  multiple  series  of  periodic  selections. 

In  Chapter  1,  I  present  a  new  method  named  AdaptML,  which  uses  a  maxi¬ 
mum  likelihood  model  to  identify  genetically-  and  ecologically-coherent  clusters  of 


bacterial  strains.  This  model  explicitly  combines  genetic  information  embedded  in 
sequence-based  phytogenies  with  environmental  sampling  data.  Recent  niche  adap¬ 
tation  events,  characterized  by  ecologically  coherent  clusters  with  minimal  genetic 
distinction  from  a  parent  clade,  can  be  captured  by  the  model.  Although  AdaptML 
cannot  detect  ecological  associations  with  unmeasured  environmental  parameters,  the 
algorithm  can  account  for  environmental  parameter  discretization  schemes  that  would 
generally  confound  previous  methods  for  detecting  ecological  associations.  To  do  this, 
I  introduce  the  model  concept  of  a  “habitat.”  Habitats  are  characterized  by  discrete 
probability  distributions  describing  the  likelihood  that  a  strain  adapted  to  a  habitat 
will  be  sampled  from  a  given  ecological  state  (e.g.  at  a  particular  location  in  an  estu¬ 
ary).  Habitats  are  not  defined  a  priori  but  rather  learned  directly  from  the  sequence 
and  ecological  data  using  an  Expectation  Maximization  routine.  Once  habitats  are 
defined,  I  learn  a  maximum  likelihood  model  for  the  evolution  of  habitat  association 
on  the  tree.  Randomization  experiments  can  be  used  to  determine  which  sequence 
clusters  show  a  statistically-significant  association  with  a  given  habitat. 
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Inferring  the  evolutionary  history  of  microbial  gene 
families 

Microbial  genomes  do  not  evolve  solely  by  point  mutation  [12,  13].  Comparison  of 
gamma-proteobacterial  genomes  suggests  gene  loss  events  eliminated  thousands  of 
genes  from  the  ancestor  of  the  Buchnera  following  its  adoption  of  an  endosymbiotic 
lifestyle  [14].  Genes  can  be  gained,  via  either  the  duplication  of  small  regions  of  the 
genome  [15],  or  via  the  duplication  of  the  entire  genome  itself,  as  has  been  shown  for 
yeast  [16] .  Gene  gain  is  also  possible  via  HGT  and  is  a  well-known  source  of  genomic 
diversity  among  the  prokaryotes  [12].  Cases  of  HGT  have  also  been  identified  between 
eukaryotes  [17,  18]  and  even  from  bacteria  to  animals  [19,  20].  Models  that  can  infer 
when  genes  have  undergone  loss,  duplication,  or  HGT,  and  when  genes  have  been 
vertically  inherited,  are  necessary  for  understanding  the  relative  contribution  of  these 
four  mechanisms  to  genome  evolution. 

Algorithms  for  inferring  the  evolutionary  history  of  gene  families  vary  according  to 
their  reliance  on  phylogenetic  models  and  how  they  account  for  gene  gain  events  (Ta¬ 
ble  1).  Phylogeny-free  methods  utilize  features  such  as  GC-bias  to  detect  xenologous 
genes  [21,  22],  or  within-genome  BLAST  searches  to  find  evidence  for  past  duplication 
events  [23] .  More  complex  approaches,  known  as  presence-absence  models,  construct 
a  phytogeny  of  sampled  species  and  identify  which  leaves  on  the  tree  are  represented 
in  a  gene  family  of  interest.  Parsimony  algorithms  can  then  be  used  to  identify  a  set  of 
ancestral  gene  duplication,  gene  loss,  or  HGT  events  to  explain  the  observed  pattern 
of  gene  presence  and  absence  on  the  species  tree  [24-27].  However,  presence/absence 
algorithms  may  underestimate  the  amount  of  HGT  in  a  gene  family  history,  since 
frequent  HGT  events  can  produce  presence/absence  patterns  similar  to  those  caused 
by  gene  birth  at  a  deep  node,  followed  by  vertical  descent.  More  sensitive  models 
capable  of  differentiating  between  these  scenarios  utilize  gene  sequence  information, 
in  addition  to  a  species  tree.  Quartet  methods  quantify  how  strongly  quartets  of 
orthologous  genes  support  each  of  the  three  possible  4-taxon  trees  representing  their 
evolutionary  history  [28,  29].  Quartets  that  strongly  support  topologies  discordant 
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Model 

Species 

tree 

Gene 

tree 

Unc. 

gene 

trees 

Finds 

HGT 

Finds 

dup. 

Refs. 

GC-bias 

No 

No 

- 

Yes 

No 

[21,  22] 

BLAST-hits 

No 

No 

- 

Yes 

Yes 

[23] 

Presence/absence 

Yes 

No 

- 

Yes 

Yes 

[24-27] 

Quartet  mapping 

Partial 

Partial 

Yes 

Yes 

No 

[28,  29] 

Parsimony  recon¬ 
ciliation 

Yes 

Yes 

No 

Yes 

Yes 

[30,  31] 

Probabilistic  rec¬ 
onciliation 

Yes 

Yes 

Yes 

Yes 

No 

[33,  34] 

Table  1:  Selection  of  existing  models  used  to  infer  gene  family  evolutionary 
histories:  Models  are  characterized  by  their  explicit  usage  of  species  trees  and  gene  trees, 
their  consideration  of  gene  tree  uncertainty,  and  their  ability  to  detect  HGT  and  dupli¬ 
cation  events.  Note  that  only  References  [27,  31]  can  find  HGT  and  duplication  events 
simultaneously. 

with  the  expected  species  tree  are  evidence  for  HGT  within  the  gene  family.  More 
elaborate  “reconciliation”  models  compare  full  gene  and  species  trees  in  order  to  in¬ 
fer  a  precise  phylogenetic  location  for  each  inferred  evolutionary  event.  Parsimony 
reconciliation  models  [30,  31],  however,  will  infer  spurious  events  if  phylogenetic  con¬ 
struction  errors  are  present  in  the  gene  tree  [32],  Newer  probabilistic  reconciliation 
algorithms  have  been  developed  to  deal  with  these  potential  inaccuracies  [33,  34], 
Gene  family  evolutionary  history  models  can  also  be  partitioned  according  to 
whether  they  account  for  gene  gain  using  duplication  or  HGT  events.  With  the 
exception  of  Snel  and  Charleston’s  algorithms  [27,  31],  evolutionary  history  models 
usually  account  for  only  one  of  these  two  events.  The  specificity  of  these  models  may 
be  caused  by  self-reinforcing  biases  associated  with  the  expected  modes  of  eukaryotic 
and  prokaryotic  genome  evolution.  The  relative  rarity  of  reported  HGT  events  among 
eukaryotes,  compared  to  duplication  events,  likely  encourages  analyses  of  eukaryotic 
genome  evolution  using  tools  specialized  only  to  detect  gene  duplications.  By  contrast, 
recognition  of  how  HGT  can  accelerate  prokaryotic  adaptation  and  blur  species  lines 
has  probably  reduced  interest  in  broad  surveys  of  potential  prokaryotic  duplication. 
Exceptions  to  this  proposed  bias  among  prokaryotic  studies  do  exist,  however,  as 
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Gevers  et  al.  cataloged  gene  duplications  in  106  bacterial  genomes  [23]  and  Snel 
searched  for  both  HGT  and  duplication  among  17  archaeal  and  bacterial  genomes 
[27].  Increasing  examples  of  HGT  among  eukaryotes  [17,  18]  are  also  fueling  new 
interest  in  systematically  searching  for  HGT  across  the  eukaryotes  [35] .  Bias  against 
the  creation  of  models  that  account  for  both  HGT  and  duplication  is  also  likely  due 
to  issues  of  model  complexity.  In  certain  scenarios,  gene  duplication  and  gene  loss 
can  produce  gene  tree  topologies  similar  to  those  yielded  by  HGT  [36].  A  combined 
HGT/duplication  inference  model  must  be  capable  of  recognizing  this  scenario  and 
proposing  plausible  HGT  and  duplication  scenarios.  Moreover,  a  combined  model 
requires  defining  a  metric  to  choose  which  of  these  scenarios  is  preferable. 

In  Chapter  2,  I  present  a  new  reconciliation  method  for  inferring  a  set  of  gene  loss, 
gene  duplication,  and  HGT  events  that  explain  topological  incongruities  between  a 
species  tree  and  a  gene  tree.  I  named  this  algorithm  the  Analyzer  of  Gene  &  Species 
Trees,  or  AnGST.  AnGST  was  inspired  by  a  gene  family  evolution  model  originally 
designed  for  problems  in  biogeography  and  the  inference  of  gene  duplication  and  gene 
loss  events  [30].  Also  referred  to  as  a  host-parasite  model,  this  approach  seeks  to  infer 
which  ancestral  genome  (the  host)  on  the  reference  tree  possessed  each  ancestral  gene 
copy  (the  parasite).  AnGST  employs  a  generalized  parsimony  framework  in  order  to 
choose  when  duplication  events  should  be  inferred  instead  of  HGT  scenarios.  This 
framework  assigns  scores  to  each  type  of  evolution  event  and  returns  the  evolutionary 
history  with  the  lowest  overall  score.  AnGST  can  further  minimize  reconciliation 
scores  by  reconciling  multiple  gene  tree  bootstraps  simultaneously  and  combining 
their  lowest  scoring  subtrees  into  a  single  chimeric  gene  tree.  This  bootstrap  amalga¬ 
mation  step  reduces  the  opportunity  for  poorly  resolved  gene  tree  subtrees  to  cause 
the  spurious  inference  of  evolutionary  events. 
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Chapter  1 


Resource  partitioning  and 
sympatric  differentiation  among 
closely  related  bacterioplankton 


Identifying  ecologically  differentiated  populations  within  complex  micro¬ 
bial  communities  remains  challenging,  yet  is  critical  for  interpreting  the 
evolution  and  ecology  of  microbes  in  the  wild.  Here  we  describe  spatial 
and  temporal  resource  partitioning  among  Vibrionaceae  strains  coexist¬ 
ing  in  coastal  bacterioplankton.  A  quantitative  model  (AdaptML)  estab¬ 
lishes  the  evolutionary  history  of  ecological  differentiation,  thus  revealing 
populations  specific  for  seasons  and  life-styles  (combinations  of  free-living, 
particle,  or  zooplankton  associations).  These  ecological  population  bound¬ 
aries  frequently  occur  at  deep  phylogenetic  levels  (consistent  with  named 
species);  however,  recent  and  perhaps  ongoing  adaptive  radiation  is  evi¬ 
dent  in  Vibrio  splendidus ,  which  comprises  numerous  ecologically  distinct 
populations  at  different  levels  of  phylogenetic  differentiation.  Thus,  envi¬ 
ronmental  specialization  may  be  an  important  correlate  or  even  trigger  of 
speciation  among  sympatric  microbes. 
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Microbes  dominate  biomass  and  control  biogeochemical  cycling  in  the  ocean,  but 
we  know  little  about  the  mechanisms  and  dynamics  of  their  functional  differentiation 
in  the  environment.  Culture-independent  analysis  typically  reveals  vast  microbial  di¬ 
versity,  and  although  some  taxa  and  gene  families  are  differentially  distributed  among 
environments  [37,  38] ,  it  is  not  clear  to  what  extent  coexisting  genotypic  diversity  can 
be  divided  into  functionally  cohesive  populations  [37,  39].  First,  we  lack  broad  surveys 
of  nonpathogenic  free-living  bacteria  that  establish  robust  associations  of  individual 
strains  with  spatiotemporal  conditions  [40,  41];  second,  it  remains  controversial  what 
level  of  genetic  diversification  reflects  ecological  differentiation.  Phylogenetic  clus¬ 
ters  have  been  proposed  to  correspond  to  ecological  populations  that  arise  by  neutral 
diversification  after  niche-specific  selective  sweeps  [5].  Clusters  are  indeed  observed 
among  closely  related  isolates  (e.g.,  when  examined  by  multilocus  sequence  analysis) 
[4]  and  in  culture-independent  analyses  of  coastal  bacterioplankton  [42],  Yet  recent 
theoretical  studies  suggest  that  clusters  can  result  from  neutral  evolution  alone  [6], 
and  evidence  for  clusters  as  ecologically  distinct  populations  remains  sparse,  having 
been  most  conclusively  demonstrated  for  cyanobacteria  along  ocean-scale  gradients 
[43]  and  in  a  depth  profile  of  a  microbial  mat  [44]).  Further,  horizontal  gene  transfer 
(HGT)  may  erode  the  ecological  cohesion  of  clusters  if  adaptive  genes  are  transferred 
[45],  and  recombination  can  homogenize  genes  between  ecologically  distinct  popu¬ 
lations  [46].  Thus,  exploring  the  relationship  between  phylogenetic  and  ecological 
differentiation  is  a  critical  step  toward  understanding  the  evolutionary  mechanisms 
of  bacterial  speciation  [6]. 

In  this  study,  we  investigated  ecological  differentiation  by  spatial  and  temporal 
resource  partitioning  in  coastal  waters  among  coexisting  bacteria  of  the  family  Vibri- 
onaceae,  which  are  ubiquitous,  metabolically  versatile  heterotrophs  [47].  The  coastal 
ocean  is  well  suited  to  test  population-level  effects  of  microhabitat  preferences,  be¬ 
cause  tidal  mixing  and  oceanic  circulation  ensure  a  high  probability  of  migration,  re¬ 
ducing  biogeographic  effects  on  population  structure.  In  the  plankton,  heterotrophs 
may  adopt  alternate  ecological  strategies:  exploiting  either  the  generally  lower  concen¬ 
tration  but  more  evenly  distributed  dissolved  nutrients  or  attaching  to  and  degrading 
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small  suspended  organic  particles,  originating  from  algal  exopolysaccharides  and  de¬ 
tritus  [39].  Bacterial  microhabitat  preferences  may  develop  because  resources  are 
distributed  on  the  same  scale  as  the  dispersal  range  of  individuals,  due  to  turbulent 
mixing  and  active  motility  [48].  Of  potential  microhabitats,  particles  represent  abun¬ 
dant  but  relatively  short-lived  resources,  as  labile  components  are  rapidly  utilized  (on 
time  scales  of  hours  to  days)  [49,  50],  implying  that  particle  colonization  is  a  dynamic 
process.  Moreover,  particulate  matter  may  change  composition  with  macroecologi- 
cal  conditions  (such  as  seasonal  algal  blooms).  Zooplankton  provide  additional,  more 
stable  microhabitats;  vibrios  attach  to  and  metabolize  chitinous  zooplankton  exoskelc- 
tons  [51,  52]  but  may  also  live  in  the  gut  or  occupy  niches  specific  to  pathogens.  The 
extent  to  which  microenvironmental  preferences  contribute  to  resource  partitioning  in 
this  complex  ecological  landscape  remains  an  important  question  in  microbial  ecology 
[53], 

We  aimed  to  conservatively  identify  ecologically  coherent  groups  by  examining 
distribution  patterns  of  Vibrionaceae  genotypes  among  free-  living  and  associated 
(with  suspended  particles  and  zooplankton)  compartments  of  the  planktonic  environ¬ 
ment  under  different  macroecological  conditions  (spring  and  fall)  (Figs.  1.3  &  1.5). 
Because  the  level  of  genetic  differentiation  at  which  ecological  preferences  develop  is 
not  known,  we  focused  on  a  range  of  relationships  (0  to  10%  small  subunit  riboso- 
mal  RNA  (rRNA)  divergence)  among  co-occurring  vibrios  [54],  Particle-associated 
and  free-living  cells  were  separated  into  four  consecutive  size  fractions  by  sequential 
filtration  (four  replicate  water  samples,  each  subsampled  with  at  least  four  replicate 
filters  per  size  fraction);  each  fraction  contained  organisms  and  dead  organic  material 
of  different  origins  (detailed  in  the  supporting  online  material  [SOM];  Section  1.2). 
For  simplicity,  we  refer  to  these  fractions  as  enriched  in  zooplankton  (>63  mm),  in 
large  (5  to  63  mm)  and  small  (1  to  5  mm)  particles,  and  in  free-living  cells  (0.22  to  1 
mm)  (Fig.  1.5B).  The  1-  to  5-mm  size  fraction  was  somewhat  ambiguous,  probably 
containing  small  particles  as  well  as  large  or  dividing  cells;  however,  it  provided  a  firm 
buffer  between  obviously  particle-associated  (>5  mm)  and  free-living  (<1  mm)  cells. 
Vibrionaceae  strains  were  isolated  by  plating  filters  on  selective  media,  previously 
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shown  by  quantitative  polymerase  chain  reaction  to  yield  good  correspondence  be¬ 
tween  genotypes  recovered  in  culture  and  those  present  in  environmental  samples  [54] . 
Roughly  1000  isolates  were  characterized  by  partial  sequencing  of  a  protein-coding 
gene  ( hsp60 ).  To  obtain  added  resolution,  between  one  and  three  additional  gene 
fragments  (■ mdh ,  adk,  and  pgi)  were  sequenced  for  over  half  of  the  isolates  (SOM), 
including  V.  splendidus  strains,  the  most  abundant  group  [54], 

Our  rationale  for  testing  environmental  associations  grows  out  of  the  following 
considerations.  First,  as  in  most  ecological  sampling,  the  true  habitats  or  niches 
are  unknown  and  can  only  be  observed  as  projections  onto  the  sampling  dimensions 
(“projected  habitats”).  Thus,  associations  can  be  detected  as  distinct  distributions 
of  groups  of  strains  if  habitats/niches  are  differentially  apportioned  among  samples. 
Second,  the  lack  of  an  accepted  microbial  species  concept  implies  that  it  is  imprudent 
to  use  any  measure  of  genetic  relationships  to  define  a  priori  the  populations  whose 
environmental  association  should  be  assessed.  Therefore,  we  first  tested  the  null 
hypothesis  that  there  is  no  environmental  association  across  the  phylogeny  of  the 
strains.  We  then  refined  such  estimates  by  developing  a  new  model  to  simultaneously 
identify  populations  and  their  projected  habitats.  Finally,  these  model-based  results 
were  tested  with  nonparametric  empirical  statistics. 

The  initial  null  hypothesis  of  no  association  between  phylogeny  and  ecology  is 
strongly  rejected  (seasons:  p  <  10”79;  size  fractions:  p  <  10”49)  by  comparing  the 
parsimony  score  of  observed  environments  on  the  tree  to  that  expected  by  chance  [55] 
(SOM),  confirming  the  visual  impression  of  differential  patterns  of  clustering  among 
seasons  and  size  fractions  (Fig.  1.1A).  This  result  is  robust  toward  uncertainty  in  the 
phylogeny,  which  should  diminish  but  not  strengthen  associations,  and  is  confirmed 
by  introducing  additional  uncertainty  in  the  phylogeny  (Fig.  1.6).  The  observed 
overall  association  with  season  and  size  fraction  therefore  suggests  that  water-column 
vibrios  partition  resources,  but  neither  provides  insights  into  the  phylogenetic  bounds 
of  populations  or  the  composition  of  their  habitats. 

We  therefore  developed  an  evolutionary  model  (AdaptML)  to  identify  popula¬ 
tions  as  groups  of  related  strains  sharing  a  common  projected  habitat,  which  reflects 
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their  relative  abundance  in  the  measured  environmental  categories  (size  fractions  and 
seasons)  (SOM).  In  practice,  the  model  inputs  are  the  phylogeny,  season,  and  size 
fraction  of  the  strains.  It  then  maps  changes  in  environmental  preference  onto  the 
tree  by  predicting  projected  habitats  for  each  extant  and  ancestral  strain  in  the  phy¬ 
logeny.  Although  similar  in  spirit  to  existing  parsimony,  likelihood,  and  Bayesian 
methods,  which  map  ancestral  states  onto  trees  [56],  the  model  accounts  for  the  com¬ 
plexities  and  uncertainties  of  environmental  sampling.  First,  projected  habitats  can 
span  multiple  sampling  dimensions  to  account  for  complex  life  cycles  (such  as  time 
spent  in  multiple  true  habitats)  and  problems  inherent  in  environmental  sampling: 
Discrete  samples  rarely  equate  to  true  habitats,  and  true  habitats  are  frequently  mis¬ 
placed  among  their  typical  sample  categories  (for  example,  zooplankton  fragments 
may  also  be  found  in  smaller  size  fractions).  Second,  projected  habitats  can  span 
multiple  phylogenetic  clusters  to  allow  for  the  possibility  that  clusters  may  arise  neu¬ 
trally  or  that  the  relevant  parameters  differentiating  them  ecologically  have  not  been 
measured. 

Briefly,  AdaptML  builds  a  hidden  Markov  model  for  the  evolution  of  habitat 
associations:  Adjacent  nodes  on  the  phylogeny  transition  between  habitats  according 
to  a  probability  function  that  is  dependent  on  branch  length  and  a  transition  rate, 
which  is  learned  from  the  data  (SOM)  (Fig.  1.7).  Subsequently,  we  optimize  the 
model  parameters  (the  transition  rate  and  the  composition  of  each  projected  habitat) 
to  maximize  the  likelihood  of  the  observed  data.  Finally,  we  use  a  simple  ad  hoc  rule 
for  reducing  noninformative  parameters:  We  merge  habitats  that  converge  to  similar 
distributions  (simple  correlation  of  distribution  vectors  >90%)  during  the  model¬ 
fitting  procedure  (SOM).  This  reproducibly  identified  six  nonredundant  habitats  for 
the  observed  data  set  (H^  to  in  Figs.  1.1B  and  1.9).  Moreover,  the  algorithm  acts 
conservatively,  as  suggested  by  two  tests.  First,  the  model  did  not  overfit  the  data 
when  there  was  no  ecological  signal  present:  When  the  environments  were  shuffled, 
only  a  single  generalist  habitat  (evenly  distributed  over  all  size  fractions  and  seasons) 
was  recovered.  Second,  when  simulated  habitats  were  used  to  generate  environmental 
assignments,  the  model  usually  identified  a  number  of  habitats  equal  to  or  less  than 
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the  true  number  present  (Fig.  1.10). 

The  analysis  suggests  that  a  single  bacterial  family  coexisting  in  the  water  column 
resolves  into  a  striking  number  of  ecologically  distinct  populations  with  clearly  iden¬ 
tifiable  preferences  (habitats).  The  algorithm  identified  25  populations,  associated 
with  one  of  the  six  habitats  defined  by  distinct  distributions  of  isolates  over  seasons 
and  size  fractions  (Fig.  1.1  and  Fig.  1.11).  Most  clusters  have  a  strong  seasonal 
signal;  interestingly,  two  pairs  of  highly  similar  habitats  are  observed  in  both  seasons 
(Fig.  1.1B).  The  first  of  the  habitat  pairs  corresponds  to  populations  occurring  both 
free-living  and  on  particles  but  lacking  zooplankton-associated  isolates  (Hs  and  H^); 
the  second  indicates  a  preference  for  zooplankton  and  large  particles  (H^  and  H^) 
(Fig.  1.1B).  The  remaining  two  habitats  were  season-specific.  Habitat  combines 
all  primarily  free-living  populations  in  the  fall,  whereas  habitat  identifies  a  second 
particle-  and  zooplankton-associated  group  in  spring,  but  unlike  and  it  has  a 
higher  proportion  of  large  particles  and  maps  onto  a  single  small  group  (G25)  (Fig. 
1.1).  However,  we  cannot  place  high  confidence  in  the  absence  of  the  free-living  habi¬ 
tat  in  the  spring,  because  relatively  few  strains  were  recovered  from  that  fraction. 
Moreover,  the  distribution  of  individual  populations  among  seasons  and  size  frac¬ 
tions  varies  considerably,  with  remarkably  narrow  preferences  for  some  populations 
whereas  others  are  more  broadly  distributed.  For  example,  V.  ordalii  (G3)  is  almost 
exclusively  free-living  in  both  seasons,  whereas  V.  alginolyticus  (G5)  has  a  significant 
representation  in  both  zooplankton  and  free-living  size  fractions  but  occurs  exclu¬ 
sively  in  the  fall  (Fig.  1.1,  A  and  B).  The  sequences  of  three  additional  genes  for  V. 
alginolyticus  isolates  were  identical,  arguing  against  misidentihcation  due  to  recombi¬ 
nation  or  additional  population  substructuring.  Similarly,  there  was  good  agreement 
when  two  different  gene  phytogenies  ( hsp60  and  mdh )  were  used  to  identify  habitats 
for  V.  splendidus  (Fig.  1.12),  although  fewer  habitats  were  identified  using  the  mdh 
tree,  most  likely  because  it  is  less  well- resolved.  Overall,  across  all  vibrios  sampled, 
association  with  the  zooplankton-enriched  and  free-living  fractions  dominated,  and 
although  several  populations  contain  particle-associated  isolates,  only  a  few  appear 
to  be  specifically  particle-adapted.  Because  vibrios  are  generally  regarded  as  particle 
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and  zooplankton  specialists  [47],  this  observed  partitioning  offers  new  insight  into 
their  ecology. 

Thus,  in  spite  of  the  highly  variable  conditions  of  the  water  column,  popula¬ 
tions  appear  to  finely  partition  resources,  especially  because  our  habitat  estimates 
are  conservative,  as  clusters  occupying  the  same  habitat  may  be  differentiated  along 
additional  (unobserved)  resource  axes.  For  example,  different  zooplankton-associated 
groups  may  be  host-  or  body  region-specific,  and  the  strong  seasonal  signal  of  most 
clusters  may  be  due  to  a  variety  of  factors;  however,  temperature  is  a  likely  candi¬ 
date  because  it  has  so  far  arisen  as  the  strongest  correlate  of  microbial  population 
changes  both  over  a  seasonal  cycle  [57]  and  along  ocean-scale  gradients  [43].  Fi¬ 
nally,  populations,  which  appear  unassociated  in  our  study,  may  be  true  generalists 
with  respect  to  the  resource  space  sampled  or  may  be  adapted  to  environments  not 
sampled  in  this  study,  such  as  animal  intestines  or  sediments  [47].  Despite  these  un¬ 
certainties,  the  observed  strong  partitioning  among  associated  and  free- living  clusters 
may  have  important  implications  for  population  biology  in  the  bacterioplankton.  As 
recently  suggested  [6],  for  attached  bacteria,  the  effective  population  size  (Ne)  may 
be  considerably  smaller  than  the  census  size  because  colonization  serves  as  a  pop¬ 
ulation  bottleneck,  whereas  in  free-living  clusters,  Ne  may  be  closer  to  the  census 
size.  Although  computing  the  true  magnitude  of  Ne  in  microbial  populations  remains 
controversial  [58] ,  it  is  an  important  parameter  that  determines  the  relative  strength 
of  selection  and  drift.  Thus,  attached  and  free-living  populations  may  evolve  under 
different  constraints  [6]. 

The  phylogenetic  structure  of  populations  also  provides  insights  into  the  history 
of  habitat  switches.  Deeply  branching  populations  may  have  remained  associated 
with  habitats  over  long  evolutionary  time,  and  shallow  branches  may  have  diversified 
more  recently  (Fig.  1.1,  A  and  B).  These  stable  habitat-associated  clusters  roughly 
correlate  to  named  species  within  the  Vibrionaceae.  For  example,  V.  ordalii  (G3)  and 
Enterovibrio  norvegicus  (G2)  both  represent  clusters  without  close  relatives  contain¬ 
ing  >  50  isolates,  which  are  overwhelmingly  predicted  to  follow  primarily  free-living 
(hU)  and  free-living/particle-associated  lifestyles  (He),  respectively  (Fig.  1.1A).  On 
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the  other  hand,  some  very  closely  related  clusters  are  associated  with  different  habi¬ 
tats;  V.  splendidus,  which  is  composed  of  strains  that  are  ~99%  identical  in  rDNA 
gene  sequence  [54],  differentiates  into  15  microdiverse  habitat-associated  clusters,  of 
which  one  is  distributed  roughly  evenly  among  both  seasons,  and  9  and  5  predom¬ 
inantly  occur  in  spring  and  fall,  respectively.  Thus,  V.  splendidus  appears  to  have 
ecologically  diversified,  possibly  by  invading  new  niches  or  partitioning  resources  at 
increasingly  fine  scales. 

Recent  or  perhaps  ongoing  radiation  by  sympatric  resource  partitioning  is  most 
strongly  suggested  for  two  nested  clusters  within  V.  splendidus ,  where  groups  of 
strains  differing  by  as  little  as  a  single  nucleotide  in  hsp60  display  distinct  ecological 
preferences  (Fig.  1.1  A,  insets,  and  Fig.  1.3).  These  strains  were  isolated  from  multiple 
independent  samples  and  thus  do  not  represent  clonal  expansion,  suggesting  that  this 
may  reflect  a  true  habitat  switch;  nonetheless,  homologous  recombination  could  also 
move  alleles  between  distantly  related,  ecologically  distinct  clusters,  creating  spurious 
phylogenetic  relationships,  which  can  be  detected  by  comparison  with  other  genes. 
Multilocus  sequence  analysis  shows  that  for  nested  cluster  I,  a  close  relationship 
was  artificially  created  because  hsp60  gene  phylogeny  is  discordant  with  three  other 
genes  (Fig.  1.2).  However,  this  still  represents  a  habitat  switch,  just  at  a  slightly 
larger  sequence  distance,  as  I.A  is  nested  within  the  much  larger  G16  cluster  in 
both  the  hsp60  and  the  mdh-pgi-adk  phytogenies.  For  the  second  nested  cluster, 
the  three  additional  genes  confirm  partial  separation  of  the  subclusters  II. A  and  II. B 
by  a  single  base  pair  difference  in  one  of  the  genes,  whereas  the  other  genes  consist 
of  identical  alleles.  This  reinforces  the  idea  that  subcluster  II. A  is  not  incorrectly 
grouped  because  of  recombination,  despite  its  distinct  ecological  affiliation  (Fig.  1.2). 
In  combination,  these  data  support  the  idea  that  there  is  ecological  differentiation 
among  recently  diverged  genotypes  and  show  that  such  changes  might  be  recognized 
in  protein-coding  genes  as  soon  as  they  accumulate  (neutral)  sequence  changes. 

How  might  adaptation  to  a  new  habitat  relate  to  speciation,  the  generation  of  dis¬ 
tinct  clusters  of  closely  related  bacteria?  Mathematical  modeling  has  recently  shown 
that  the  dynamics  of  speciation  depend  on  the  ratio  of  homologous  recombination  to 
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mutation  rates  (r/m)  [6].  When  this  ratio  per  allele  exceeds  ~1,  populations  tran¬ 
sition  from  essentially  clonal  to  sexual,  with  the  major  consequence  that  selection  is 
probably  required  for  the  formation  of  clusters  [6].  Our  preliminary  multilocus  se¬ 
quence  analysis  on  a  set  of  strains  with  similar  taxonomic  composition  suggests  that 
their  r/m  is  well  above  that  threshold.  Thus,  our  observations  of  habitat  separation 
for  highly  similar  but  clearly  distinct  genotypes  suggest  that  ecological  selection  may 
have  triggered  phylogenetic  differentiation.  A  plausible  mechanism  is  that  differen¬ 
tial  distribution  among  habitats  (possibly  caused  by  few  adaptive  loci)  is  sufficient 
to  depress  gene  flow  between  associated  genotypes  [6,  59].  Consequently,  mutations 
will  no  longer  be  homogenized  but  instead  accumulate  within  specialized  populations, 
even  for  ecologically  neutral  genes.  Over  time,  genetic  isolation  may  increase  because 
homologous  recombination  rates  decrease  log-linearly  with  sequence  distance  [60] .  We 
detected  associations  with  different  habitats  among  sister  clades  over  a  wide  range 
of  phylogenetic  distances,  possibly  representing  populations  at  various  stages  of  dif¬ 
ferentiation  (Fig.  1.1A).  Although  we  cannot  determine  whether  clusters  represent 
transiently  adapted  populations  or  nascent  species,  our  observations  of  differential 
distributions  of  genotypes  suggest  that  there  exists  a  small-scale  adaptive  landscape 
in  the  water  column  allowing  the  initiation  of  (sympatric)  speciation  within  this  com¬ 
munity. 

Although  it  has  recently  been  suggested  that  microbial  lineages  remain  specific 
to  macroenvironments  over  long  evolutionary  times  [61],  this  study  demonstrates 
switches  in  ecological  associations  within  a  bacterial  family  coexisting  in  the  coastal 
ocean.  In  the  V.  splendidus  clade,  speciation  could  be  ongoing,  but  the  divergence 
between  most  other  ecologically  defined  groups  appears  large.  This  is  consistent  with 
our  previous  suggestion  that  rRNA  gene  clusters,  which  are  roughly  congruent  with 
the  deeply  divergent  protein-coding  gene  clusters  detected  here,  represent  ecological 
populations  [42] .  However,  the  example  of  V.  splendidus  highlights  the  fact  that  using 
marker  genes  to  assess  community-wide  diversity  may  not  capture  some  ecological 
specialization.  Moreover,  different  groups  of  organisms  could  evolve  under  different 
constraints,  and  the  mechanisms  suggested  here  apply  to  the  invasion  of  new  habitats 
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and  are  thus  different  from  (but  compatible  with)  the  widely  discussed  niche-specific 
selective  sweeps  [10].  Why  V.  splendidus  appears  to  have  radiated  recently  into  new 
habitats  whereas  other  groups  appear  to  be  more  constant  is  not  known  but  may 
be  related  to  its  high  heterogeneity  in  genome  architecture  [54],  This  could  indicate 
a  large  (flexible)  gene  pool  that,  if  shared  by  horizontal  gene  transfer,  gives  rise  to 
large  numbers  of  ecologically  adaptive  phenotypes.  It  will  therefore  be  important 
to  compare  whole  genomes  within  recently  ecologically  diverged  clusters  to  identify 
specific  changes  leading  to  adaptive  evolution. 
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1.1 


Figures 


Figure  1.1:  Season  and  size  fraction  distributions  and  habitat  predictions  mapped  onto 
Vibrionaceae  isolate  phytogeny  inferred  by  maximum  likelihood  analysis  of  partial  hsp60 
gene  sequences.  Projected  habitats  are  identified  by  colored  circles  at  the  parent  nodes. 
(A)  Phylogenetic  tree  of  all  strains,  with  outer  and  inner  rings  indicating  seasons  and 
size  fractions  of  strain  origin,  respectively.  Ecological  populations  predicted  by  the  model 
are  indicated  by  alternating  blue  and  gray  shading  of  clusters  if  they  pass  an  empirical 
confidence  threshold  of  99.99%  (see  SOM  for  details).  Bootstrap  confidence  levels  are  shown 
in  Fig.  1.14.  (B)  Ultrametric  tree  summarizing  habitat-associated  populations  identified 
by  the  model  and  the  distribution  of  each  population  among  seasons  and  size  fractions.  The 
habitat  legend  matches  the  colored  circles  in  (A)  and  (B)  with  the  habitat  distribution  over 
seasons  and  size  fractions  inferred  by  the  model.  Distributions  are  normalized  by  the  total 
number  of  counts  in  each  environmental  category  to  reduce  the  effects  of  uneven  sampling. 
The  insets  at  the  lower  right  of  (A)  show  two  nested  clusters  (I.A  and  I.B  and  II. A  and  II. B) 
for  which  recent  ecological  differentiation  is  inferred,  including  habitat  predictions  at  each 
node.  The  closest  named  species  to  numbered  groups  are  as  follows:  Gl,  V.  calviensis ;  G2, 
Enterovibrio  norvegicus ;  G3,  V.  ordain;  G4,  V.  rumoiensis ;  G5,  V.  alginolyticus ;  G6,  V. 
aestuarianus ;  G7,  V.  fischeri/logei ;  G8,  V.  fischeri ;  G9,  V.  superstes ;  G10,  V.  penaeicida; 
Gil  to  G25,  V.  splendidus. 
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Figure  1.2:  Multilocus  sequence  analysis  of  nested  clusters  (IA  and  IB  and  IIA  and  IIB) 
with  differential  habitat  association  by  comparison  of  partial  hsp60  (left)  and  concatenated 
partial  mdh,  adk,  and  pgi  (right)  gene  phytogenies.  Habitat  predictions  (indicated  by  colored 
boxes)  and  the  numbering  of  clusters  correspond  to  Fig.  1.1.  Scale  bar  is  in  units  of 
nucleotide  substitutions  per  site. 
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1.2  Supplementary  Material 

1.2.1  Sampling  rationale 

To  investigate  partitioning  of  Vibrionaceae  strains  in  the  water  column,  we  exam¬ 
ined  their  distribution  among  the  free-living  and  associated  (with  particles  and  zoo¬ 
plankton)  fractions  of  the  bacterioplankton  community  at  two  time  points.  This  was 
achieved  by  sequential  filtration  with  decreasing  pore  size  cutoffs  and  subsequent  cul¬ 
tivation  on  Vibrio  selective  media  (Fig.  1.5B).  Here,  we  give  additional  details  on 
sampling  protocols  and  rationale  supplementing  the  overview  given  in  the  main  text. 

Filtration  is  commonly  used  in  oceanography  to  separate  particle-associated  and 
free-living  populations  by  retention  of  particles  on  filters,  although  the  filter  size  cut  off 
for  collecting  particle-attached  bacteria  has  varied  in  past  studies  between  0.8  and  10 
/ini  [62-65] .  To  obtain  higher  ecological  resolution,  we  used  sequential  filtration  since 
alternate  types  of  particulate  organic  matter  and  organisms  (e.g.,  phytoplankton, 
zooplankton)  will  have  distribution  maxima  in  different  size  fractions  thus  enabling 
differentiation  of  associated  bacterial  genotypes. 

We  collected  a  total  of  four  size  fractions  with  different  expected  composition 
of  particles  and  organisms  (Fig.  1.5B).  The  largest  fraction  (>63  m)  was  visually 
enriched  in  zooplankton  and  detrital  material  (e.g.,  pieces  of  macroalgae,  terrestrial 
plant  material);  however,  large  gelatinous  material  [frequently  part  of  marine  snow, 
which  represents  particles  >0.5  mm  [66]]  was  likely  not  collected  since  it  is  disrupted 
by  the  pressure  on  the  plankton  nets  used  for  collection.  All  other  fractions  were 
collected  by  gravity  rather  than  vacuum  filtration  to  minimize  disruption  of  frag¬ 
ile  particles.  The  large  particle  fraction  (63-5  /mi)  likely  contains  zooplankton  fecal 
pellets,  dead  and  living  algae,  and  other  detritus.  The  composition  of  the  5-1  /irn 
size  fraction  is  somewhat  ambiguous  since  it  may  contain  both  cells  attached  to  very 
small  particles  as  well  as  large  or  dividing  cells;  however,  it  provides  a  firm  buffer 
between  obviously  particle-attached  (>5  /mi)  and  free-living  (<l/im)  cells.  Particu¬ 
late  material  in  this  size  range  may  include  small  algae,  bacterial  cell  walls,  as  well  as 
fragments  of  larger  particles,  which  have  broken  apart;  nonetheless,  the  small  size  of 
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such  particles  in  unlikely  to  sustain  a  resident  bacterial  population.  Free-living  bac¬ 
teria,  observed  in  the  1-0.22  /jrn  size  fraction,  likely  live  on  dissolved  organic  matter 
produced  by  living  algae,  cell  lysis  and  the  dissolution  of  particles. 

1.2.2  Sample  collection 

Coastal  ocean  water  samples  were  collected  at  high  tide  on  the  marine  end  of  the 
Plum  Island  Estuary  (NE  Massachusetts)  (Fig.  1.5 A)  on  two  days  representing  spring 
(4/28/06)  and  fall  (9/6/06)  conditions  in  the  coastal  ocean.  Nutrient  concentrations, 
water  temperature  and  chlorophyll  levels  were  measured  on  both  sampling  dates  (Fig. 
1.3). 

Two  replicate  samples  of  the  largest  size  fraction  (enriched  in  zooplankton)  were 
collected  by  filtering  ~100  L  each  through  a  63  /jrn  plankton  net,  which  was  sub¬ 
sequently  washed  with  sterile  seawater  (Fig.  1.5B).  Particle-associated  and  free- 
living  bacterial  populations  were  collected  from  quadruplicate  water  samples,  which 
were  independently  2  pre-hltered  through  the  63  /irn  plankton  net  (to  remove  the 
zooplankton-enriched  fraction)  into  4  L  nalgene  bottles  (Fig.  1.5B).  For  each  bottle, 
water  was  sequentially  filtered  through  5,  1  and  0.22  /irn  pore  size  filters,  collecting 
at  least  four  replicate  filters  per  size  fraction.  To  avoid  disruption  of  fragile  parti¬ 
cles,  the  63-5  and  5-1  //nr  fractions  were  collected  on  polycarbonate  membrane  filters 
(Sterlitech)  using  gravity  filtration  followed  by  washing  with  5  ml  of  sterile  (0.22 
/rm- filtered  and  Tindalized)  seawater  to  remove  free-living  bacteria  that  might  have 
been  retained  on  the  filter.  The  <1  /jrn  fraction  containing  free-living  bacteria  was 
collected  on  0.22  /jrn  Supor-200  filters  (Pall)  by  applying  gentle  vacuum  pressure. 

After  size  fractionation,  particles  and  zooplankton  were  broken  up  before  plating 
(Fig.  1.5B).  The  zooplankton  sample  was  homogenized  using  a  tissue  grinder  (VWR 
Scientific)  and  vortexed  for  20  minutes  at  low  speed  before  concentration  on  0.22  /jrn 
Supor-200  filters  (Pall).  These  filters  were  then  plated  directly  on  selective  media. 
Similarly,  5  /jrn  and  1  /jrn  filters  were  placed  in  50  ml  conical  tubes  with  50  ml  sterile 
seawater  and  vortexed  at  low  speed  for  20  min  to  break  up  particles  and  detach 
bacteria  from  the  filters.  The  supernatant  was  concentrated  on  0.22  /jrn  filters,  and 


both  the  original  and  supernatant  filters  were  placed  directly  on  media  to  collect 
isolates. 

1.2.3  Strain  isolation  and  identification 

Isolates  were  obtained  from  TCBS  plates  (Accumedia  or  Difo)  with  2%  NaCl  since 
this  media  has  been  shown  to  yield  good  correspondence  in  phylogenetic  groups  of 
vibrios  detected  by  quantitative  PCR  and  isolation  [54],  After  2-3  days  of  growth, 
colonies  were  counted  and  re-streaked  a  total  of  three  times  alternately  on  Tryptic 
Soy  Broth  (TSB)  (Difco)  with  2%  NaCl  and  media. 

For  classification  of  strains  by  sequencing,  purified  isolates  were  grown  in  marine 
TSB  broth  overnight;  DNA  was  extracted  using  either  a  tissue  DNA  kit  (Qiagen)  or 
Lyse-  N-Go  (Pierce).  Following  the  rationale  of  multilocus  sequence  analysis  (MLSA), 
housekeeping  genes  were  used  for  further  strain  characterization  since  these  are  un¬ 
likely  to  be  under  environmental  selection.  The  partial  hsp60  gene  sequence  was 
amplified  for  all  isolates  as  described  previously  [67].  For  isolates  with  an  hsp60 
sequence  differing  by  more  than  2%  from  an  already  characterized  strain,  the  16S 
rRNA  gene  was  PCR  amplified  using  primers  27F-  1492R  and  sequenced  using  the 
27F  primer  [68].  The  16S  sequence  was  used  to  identify  the  organism  using  the 
RDP  classifier  [69]  and  BLAST  [70].  In  cases  where  the  hsp60  gene  either  failed 
to  amplify  or  the  sequence  diverged  greatly  from  other  vibrios,  16S  rRNA  gene  se¬ 
quencing  confirmed  that  these  isolates  largely  belonged  to  the  genera  Pseudomonas, 
Shewanella ,  Pseudoalteromonas ,  and  Agaravorans  (RDP  Classifier)  [69].  We  excluded 
non-  Vibrionaceae  strains  from  further  analysis. 

To  confirm  relationships  for  V.  splendidus,  the  most  highly  represented  group 
among  isolates,  an  additional  gene  ( mdh )  was  sequenced.  The  partial  mdh  gene  was 
amplified  using  primers  mdh. mod. for  (5’-  GAY  CTD  AGY  CAY  ATC  CCW  AC  -3’) 
and  mdh. mod.rev  (5’-  GCT  TCW  ACM  ACY  TCD  GTR  CCY  G  -3’)  (S.  Preheim, 
unpublished  data).  Two  additional  housekeeping  gene  sequences  were  obtained  (pgi, 
adk)  for  select  groups  of  strains,  using  pgi.kn'  (5  GAC  CTW  GGY  CCW  TAG  ATG 
GT  3  -  3)  and  pgi. rev  (5-CMG  CRC  CRT  GGA  AGT  TGT  TRT-3)  (S.  Preheim, 
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unpublished  data)  and  ad/;;. for  (5-  GTA  TTC  CAC  AAA  TYT  CTA  CTG  G-3)  and 
adk. rev  (5-  GCT  TCT  TTA  CCG  TAG  TA-  3)  [71].  All  of  these  genes  were  amplified 
using  the  following  PGR  conditions:  2  min  at  94°C  followed  by  32  cycles  of  1  min  each 
at  94°C,  46°  and  72°C  with  a  final  step  of  6  min  at  72°C.  Most  genes  were  sequenced 
at  least  twice  using  forward  and  reverse  primers.  All  sequencing  was  performed  at 
the  Bay  Paul  Center  at  the  Marine  Biological  Laboratory,  Woods  Hole  MA. 

1.2.4  Phylogenetic  tree  construction  and  representation 

The  partial  hsp60,  mdh,  adk,  and  pgi  gene  sequences  yielded  unambiguous  align¬ 
ments  of  541,  422,  372,  395  nucleotides,  respectively.  Phylogenetic  relationships  were 
reconstructed  using  PhyML  v.2.4.4  [72]  with  following  parameter  settings:  DNA  sub¬ 
stitution  was  modeled  using  the  HKY  parameter  [73];  the  transition/transversion 
ratio  was  set  to  4.0;  PhyML  estimated  the  proportion  of  invariable  nucleotide  sites; 
the  gamma  distribution  parameter  was  set  to  1.0;  4  gamma  rate  categories  were  used; 
a  B10NJ  tree  was  initially  used;  and,  both  tree  topology  and  branch  lengths  were 
optimized  by  PhyML.  Circular  tree  figures  were  drawn  using  the  online  iTOL  software 
package  [74],  To  prevent  numerical  instabilities  in  AdaptMLs  maximum  likelihood 
computations,  branches  with  zero  length  were  assigned  the  minimal  observed  non-zero 
branch  length:  0.001. 

1.2.5  Empirical  statistical  testing 

We  employed  empirical  statistics  to  quantify  evidence  for  differential  environmental 
distribution  of  phylogenetic  groups  (Fig.  1.6).  We  first  tested  the  overall  association 
of  phylogeny  with  our  environmental  data  using  a  non-parametric  parsimony-based 
metric.  We  assigned  a  different  character  to  each  of  the  environmental  categories, 
and  calculated  the  minimum  number  of  character  transitions  needed  to  explain  the 
data  given  the  observed  hsp60  phylogeny.  Although  this  test  is  likely  to  be  overly 
conservative  given  the  heterogeneous  nature  of  our  observed  clusters,  it  nonetheless 
supported  a  highly  significant  correlation  between  phylogeny  and  both  size  fraction  (p 
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<  10-49)  and  season  (p  <  1CT '9).  Exact  p- values  were  computed  based  on  the  algo¬ 
rithm  of  [55].  It  was  not  possible  to  compute  p- values  for  both  season  and  size  fraction 
together  because  the  computational  complexity  of  the  algorithm  grows  exponentially 
with  the  number  of  character  states. 

We  also  employed  non-parametric  empirical  statistics  to  test  specific  model  pre¬ 
dictions.  We  tested  the  hypothesis  that  each  of  the  clusters  identified  by  the  model 
would  be  likely  to  arise  by  chance.  To  do  this,  we  produced  a  2  x  8  contingency  table 
to  test  for  any  associations  between  cluster  membership  and  distribution  across  envi¬ 
ronments.  We  used  the  Fisher  exact  test  [75]  as  implemented  in  the  R  programming 
language  to  evaluate  the  significance  of  each  association.  The  results  are  shown  in 
Figure  1.4. 

1.2.6  Overview  of  Adapt  ML 

We  developed  a  maximum  likelihood  method  to  help  identify  the  boundaries  of  eco¬ 
logically  distinct  populations  and  infer  the  ancestral  habitat  association  of  internal 
nodes  in  the  strain  phytogeny.  The  key  to  our  method  is  a  hidden  variable  mapping  a 
’projected  habitat’  to  each  node.  We  mathematically  characterize  each  habitat  as  a 
discrete  probability  distribution,  which  describes  the  likelihood  that  a  strain  adapted 
to  that  habitat  will  be  observed  in  each  of  our  eight  environmental  categories.  These 
distributions,  which  we  refer  to  as  emission  probabilities  in  accordance  with  terminol¬ 
ogy  used  in  machine  learning,  are  not  known  a  priori  and  must  be  learned  from  the 
data.  Because  of  this  probabilistic  definition  of  habitats,  a  phylogenetic  group  span¬ 
ning  several  environmental  categories  can  still  be  considered  an  ecologically  distinct 
population. 

Using  the  habitat  variables  and  isolate  sequence-based  phytogeny,  we  built  a 
hidden-Markov  model  (HMM)  [75]  describing  the  evolution  of  habitat  association 
(Figure  1.7).  The  probability  that  adjacent  nodes  in  the  phylogeny  share  the  same 
habitat  is  a  function  of  both  the  branch  length  separating  them  and  a  parameter  that 
represents  the  rate  at  which  a  lineage  can  transition  between  habitats  (the  transition 
rate).  The  observed  variables  the  environmental  category  from  which  each  strain 
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was  sampled  occur  only  at  the  leaves  of  the  phylogeny.  The  parameters  necessary 
for  our  model  can  be  learned  from  the  data  according  to  the  following  algorithm: 

1.  Initialize  parameters:  We  initialize  16  habitats,  each  with  random  emission 
probability  distributions  over  the  8  environmental  categories.  The  transition 
rate  parameter  is  initialized  to  ICE1  transitions  per  substitution/site  (relative 
to  the  gene  phylogeny  branch  lengths). 

2.  Infer  the  observed  datas  likelihood,  given  phylogeny  and  parameter 
estimates:  We  use  a  dynamic  programming  algorithm  and  the  model  param¬ 
eters  (transition  rate  and  emission  probabilities)  to  compute  the  likelihood  of 
the  observed  data  (environmental  category  for  each  isolate).  Our  computation 
proceeds  in  a  manner  identical  to  Felsenstein’s  “pruning”  method  of  computing 
likelihoods  on  a  phylogenetic  tree  [76]. 

3.  Optimize  parameters  to  maximize  likelihood  of  observed  data:  We  es¬ 
timate  the  probability  that  each  internal  node  is  associated  with  a  given  habitat 
by  summing  over  all  possible  habitat  assignments  at  other  nodes  (E-step).  These 
probabilities  are  used  (M-step)  to  update  the: 

(a)  Transition  rate  parameter :  We  numerically  optimize  the  transition  rate  to 
maximize  the  likelihood  of  the  observed  data. 

(b)  Emission  probability  matrix :  We  update  the  emission  probability  matrix 
by  taking  the  matrix  that  maximizes  the  likelihood  of  the  observed  data 
given  the  marginal  likelihoods  for  the  habitat  assignments  at  each  of  the 
phylogenys  leaves. 

We  note  that  separating  these  two  steps  represents  an  approximation,  as  these 
two  parameters  are  not  strictly  independent.  The  approximation,  however, 
speeds  up  the  implementation  considerably  as  only  one  parameter  (instead  of  1 
+  16  x  7  =  113)  is  optimized  numerically. 

4.  Test  for  convergence:  If  the  model  parameters  do  not  change  significantly 
from  the  previous  iteration,  then  the  emission  probabilities  and  the  transition 
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rate  are  considered  to  have  converged:  continue  to  step  (5).  (A  typical  tra¬ 
jectory  of  emission  probability  convergence  is  shown  in  Figure  1.8.  Note:  the 
approximation  identified  in  step  3  can  lead  to  fluctuation  near  a  likelihood  max¬ 
imum  rather  than  actual  convergence).  Otherwise,  return  to  step  (2). 

5.  Test  for  model  complexity /redundancy.  If  a  pair  of  habitats  has  emission 
probability  distributions  that  exhibit  correlations  greater  than  0.90,  they  are 
merged  into  a  single  habitat  and  the  algorithm  continues  from  step  (2).  If 
no  habitats  are  merged,  the  parameter  estimation  loop  terminates.  Although 
our  approach  employed  manual  inspection  and  empirical  testing  rather  than 
a  likelihood-based  criterion  for  reducing  model  complexity  [such  as  the  AIC 
[77]],  our  algorithm  can  be  easily  extended  to  include  a  likelihood  criterion. 
To  test  for  overfitting,  we  performed  simulations  as  described  in  the  main  text 
and  Figure  1.10.  We  found  that  our  scheme  acts  conservatively  since  it  usually 
underestimated  the  true  number  of  habitats.  Figure  1.13  shows  how  the  inferred 
habitats  identified  by  the  model  vary  with  different  cutoffs. 

Once  a  set  of  model  parameters  has  been  learned,  we  utilize  the  following  protocol 
to  identify  ecologically  distinct  and  statistically  significant  populations. 

1.  Infer  node  habitat  assignments  that  maximize  the  joint  probability 
of  the  observed  data:  We  rely  upon  a  joint  likelihood  calculation  to  infer  a 
single  habitat  assignment  per  ancestral  node.  To  compute  this  likelihood,  we 
use  the  parameter  estimates  inferred  by  the  algorithm  described  above,  which 
sums  over  all  habitat  assignments.  Phylogenetic  groups  that  share  a  common 
habitat  association  are  taken  as  candidate  ecological  populations  if  they  pass 
an  empirical  significance  test. 

2.  Empirical  testing  to  identify  ecologically  distinct  populations:  The 

assignment  of  nodes  to  habitats  in  step  (1)  identifies  the  most  likely  set  of 
population  boundaries,  but  may  include  some  weakly  predicted  clusters.  To 
filter  low  confidence  ecological  groupings,  we  estimate  empirical  p-values  for 


33 


each  clade  and  only  report  statistically  significant  (p  <  0.0001)  populations 
(Figure  1A  &  IB).  Empirical  p- values  are  computed  by  comparing  the  likelihood 
of  the  parent  node  for  a  cluster  to  the  likelihood  observed  at  the  same  node  in 
randomized  trials  where  environmental  assignments  are  shuffled,  but  phylogeny 
is  maintained.  For  comparison,  all  possible  clusters  (with  no  significance  cutoff) 
can  be  inferred  from  the  full  model  results  shown  in  Figure  1.11. 


1.2.7  Detailed  description  of  maximum  likelihood  model 

Conditional  likelihoods 


The  conditional  likelihood  describes  the  likelihood  L  that  the  leaves  of  a  subtree 
exhibit  their  observed  states,  conditional  on  the  subtree’s  root  node  k  taking  state  s. 
This  likelihood  can  be  defined  recursively,  assuming  two  child  nodes  l  and  m 


Lk(s)  = 


^^P(x|s,t;)Lw(x)j  x 


P(y\s,tm)Lkm(y) 


(1.1) 


where  the  function  P(x\s,t)  represents  the  probability  of  transitioning  between 
states  x  and  s  along  some  interval  t.  To  reduce  the  number  of  fitted  parameters 
early  in  our  model,  we  use  the  simplifying  assumption  that  all  state  transitions  can 
be  described  using  the  same  transition  rate  parameter  y.  Thus,  we  compute  the 
probability  of  transitions  between  states  as 


and  the  probability  of  remaining  in  the  same  state 


(1.2) 


P(x\x,t)  =  i  (1  +  (ft  -  1)  x  e-hl“)  (1.3) 

which  is  analogous  to  the  Jukes-Cantor  model  for  nucleotide  sequence  evolution 
[78].  Note  that  if  k  is  a  leaf  node  in  state  s  with  observed  environment  o,  the  likelihood 
is  drawn  from  the  emission  probability  matrix  Pe: 
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Lk(s)  —  Pe(o\s) 


(1.4) 


Marginal  and  joint  likelihoods 

To  compute  the  marginal  likelihood  that  a  node  k  has  state  s,  we  combine  the  con¬ 
ditional  likelihoods: 


MLk(s)  =  -x[] 


Ep<  x\ s,  tki)Li{x ) 


(1.5) 


where  the  l  are  drawn  from  the  set  of  nodes  adjacent  to  k,  and  K  is  a  normalization 
factor  such  that 


YJMLk{s)  =  l  (1.6) 

S 

To  compute  the  likelihood  of  the  observed  data  for  the  single  best  assignment  of 
habitats  to  nodes  (IT),  the  summation  terms  in  the  likelihood  formula  are  replaced 
with  max  operations.  This  is  equivalent  to  maximizing  the  “joint”  likelihood: 

Lk(s)  =  (max P(x\s,tki)Li(x)Sj  x  ^maxP(y\s,tkrn)Lm(y)^  (1.7) 

The  backtracking  process  used  to  keep  track  of  the  max  arguments  is  analogous 

to  the  Viterbi  algorithm  for  finding  the  most  probable  state  path  in  an  HMM  (16). 

Parameter  estimation 

As  probability  distributions,  each  set  of  emission  probabilities  must  satisfy 

Ep«(°w  =  i  (i-s) 

O 

Because  leaf  nodes  are  independent  samples  of  their  emission  probability  distri¬ 
butions  (given  their  state  assignment),  we  use  a  weighted-average  approach  to  calcu¬ 
lating  the  emission  probability  matrix  Pe 
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(1.9) 


/•  io|Sj  =  -(  x  y  MLk(s)  x  A (o,ok) 

k 

where  MLk(s)  is  the  marginal  likelihood  that  node  k  is  adapted  to  habitat  s,  the 
function  A  (  x,y )  equals  1  if  and  only  if  x  equals  y  (and  is  0  otherwise),  and  K  is  a 
normalization  factor. 

We  learn  the  transition  rate  parameter  y  by  numerical  optimization  to  maximize 
the  likelihood  of  the  observed  data  (summed  over  all  values  for  the  hidden  variables). 

1.2.8  Defining  ecologically  coherent  and  significant  clusters 

We  use  empirical  testing  to  estimate  the  statistical  significance  associated  with  the 
likelihood  value  computed  for  each  node  in  the  maximum  (joint)  likelihood  assignment 
of  habitats  to  nodes  (given  the  final  parameter  estimates).  Each  trial  preserved 
the  phylogeny,  the  inferred  habitat  assignments,  and  the  habitat  and  transition  rate 
parameters,  but  environmental  categories  at  the  leaves  were  shuffled.  The  maximum 
’joint’  likelihoods  at  each  node  were  compiled  over  all  trials  and  used  as  empirical 
background  probability  distributions. 

We  use  empirical  testing  to  estimate  the  statistical  significance  associated  with 
the  likelihood  value  computed  for  each  node  in  the  maximum  (joint)  likelihood  assign¬ 
ment  of  habitats  to  nodes  (given  the  final  parameter  estimates).  Each  trial  preserved 
the  phylogeny,  the  inferred  habitat  assignments,  the  habitat  and  transition  rate  pa¬ 
rameters,  and  the  frequency  of  the  various  environmental  categories;  however,  each 
leaf  was  randomly  assigned  an  environmental  category  in  each  trial.  The  maximum 
joint  likelihoods  at  each  node  were  compiled  over  all  trials  and  used  as  empirical 
background  probability  distributions. 

Using  the  habitat  associations  learned  from  the  original  (non-randomized)  data 
set,  we  identified  internal  nodes  where  habitat  transitions  were  inferred  to  take  place. 
These  nodes  were  then  iterated  through  using  a  post-fix  traversal: 
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Nested  clusters 


At  each  of  these  transitional  nodes,  a  second,  pre-fix  traversal  took  place.  If  the  sub¬ 
tree  rooted  by  the  current  node  possessed  a  likelihood  greater  than  that  observed  in 
99.99%  of  the  random  trials  and  90%  of  its  leaves  shared  the  current  nodes  habitat 
assignment,  we  recognized  the  subtree  as  an  ecologically  coherent  and  statistically 
significant  cluster.  This  latter  cutoff  (90%  coherence)  was  necessary  to  ensure  mean¬ 
ingful  clusters  because  the  likelihood  at  a  parent  node  can  be  unusually  high  when  two 
nearly  significant  (but  non-identical)  child  groups  are  combined.  Requiring  a  major¬ 
ity  of  child  nodes  to  have  the  same  predicted  model  state  as  the  parent  has  the  effect 
of  identifying  clusters  that  correspond  more  directly  to  single  putative  populations. 
Making  the  threshold  too  high  (e.g.,  100%)  would  eliminate  some  larger  clusters  that 
have  a  small,  internal  nested  cluster. 

To  enable  the  discovery  of  nested  clusters,  identified  clusters  were  pruned  from 
the  phylogeny.  Nodes  representing  clades  that  contained  the  cluster  had  their  joint- 
likelihoods  modified  so  that  they  no  longer  incorporated  information  from  the  pruned 
groups.  If  the  clade  rooted  by  the  current  node  did  not  satisfy  both  the  p- value  and 
90%  coherence  thresholds,  the  second  recursion  would  descend  to  the  current  nodes 
children.  The  second  recursion  only  terminated  if  the  current  node  was  either  a  leaf, 
itself  a  habitat  transition  node,  or  determined  to  root  an  ecologically  coherent  and 
statistically  significant  cluster. 
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Supplementary  Figures 
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1  measured  using  overnight  extraction  in  90%  acetone  (79) 

2  DOC  =  dissolved  organic  carbon,  TDN  =  Total  Dissolved  Nitrogen,  TDP  =  total  dissolved  phosphorous. 
All  chemical  analyses  were  performed  at  the  University  of  New  Hampshire,  Durham,  NH. 


Figure  1.3:  Temperature  and  nutrient  concentrations  on  sampling  dates. 


38 


Group 

P-value 

1 

1.16  E-07 

2 

1.04  E-16 

3 

5.06  E-15 

4 

1.70  E-01 

5 

1.11  E-02 

6 

1.26  E-04 

7 

7.54  E-10 

8 

2.46  E-05 

9 

2.35  E-07 

10 

7.19  E-07 

11 

1.26  E-13 

12 

2.62  E-03 

13 

5.13  E-10 

14 

1.16  E-06 

15 

1.01  E-02 

16 

1.13  E-08 

17 

4.03  E-07 

18 

2.81  E-03 

19 

1.08  E-08 

20 

1.80  E-06 

21 

2.31  E-13 

22 

7.02  E-03 

23 

3.32  E-06 

24 

6.06  E-14 

25 

2.76  E-03 

Figure  1.4:  Empirical  significance  testing  of  ecologically  differentiated  clusters  predicted 
by  the  model.  Fishers  exact  test  was  used  to  identify  significant  associations  between 
clusters  and  environments  as  described  in  the  SOM.  Group  numbers  correspond  to  Figure 
1.1  in  the  main  text. 
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Figure  1.5:  Sampling  site  and  outline  of  sampling  strategy  for  determination  of  bacterial 
distribution  among  seasons  and  size  fractions  in  coastal  water.  (A)  Sampling  location 
shown  on  a  map  of  North  America  (left)  with  a  white  box  depicting  the  bounds  of  the 
picture  at  right:  the  Gulf  of  Maine.  The  arrow  points  to  the  sampling  location,  Plum  Island 
Sound,  MA.  (B)  Protocol  for  obtaining  size  fractionated  of  bacterial  seawater  isolates  using 
sequential  filtration  and  plating  on  selective  media. 
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Figure  1.6:  Empirical  statistical  testing  for  ecological  association  of  phylogenetic  clades. 
The  likelihood  that  the  observed  parsimony  score  (or  lower)  for  size  fraction  data  might  have 
arisen  by  chance  was  calculated  [using  the  method  of  [55]]  for  a  series  of  trees,  inferred  using 
subsets  of  strains  from  the  hsp60  gene  alignment.  To  test  the  effect  of  statistical  uncertainty 
on  the  inferred  association,  1%,  10%,  50%,  90%,  or  100%  of  the  sequence  data  was  randomly 
selected  and  used  to  construct  the  phylogeny  (with  3  replicates  each).  The  parsimony  score 
for  each  such  tree  is  shown  with  green  dots  corresponding  to  the  left  axis;  the  p- value  of 
obtaining  that  parsimony  score  by  chance  is  shown  with  blue  dots  corresponding  to  the  right 
axis.  These  results  are  based  on  size  fraction  data  only,  but  similar  results  are  obtained 
with  season  data. 
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Figure  1.7:  Overview  of  hidden  Markov  model  components.  (A)  Habitats  represent  hid¬ 
den  (latent)  variables  in  the  model;  experiments  do  not  directly  measure  what  habitat  a 
strain  is  adapted  to.  Two  adjacent  nodes  on  the  phylogeny  may  differ  in  their  habitat 
assignment  according  to  the  rate  of  habitat  transition  (arrows  between  habitats).  In  our 
simple  model,  all  transition  rates  are  equivalent.  (B)  Associated  with  each  habitat  is  an 
emission  probability  distribution  describing  how  likely  a  strain  associated  with  a  particular 
habitat  (H1-H3)  is  sampled  from  a  given  environment  (blue  or  black).  Bars  in  this  car¬ 
toon  depict  hypothetical  probability  distributions;  strains  adapted  to  habitat  1  have  higher 
probability  of  being  observed  in  the  black  environment  than  in  the  blue  environment.  (C) 
Our  model  maps  a  habitat  onto  each  node  in  the  phylogeny  such  that  it  maximizes  the 
probability  of  the  observed  data  (at  the  leaves).  Observations  are  limited  to  the  leaves  of 
the  phylogeny,  as  we  cannot  directly  sample  ancestral  strains.  In  the  example  shown,  the 
mostly  blue  group  is  mapped  to  habitat  3,  the  mostly  black  group  to  habitat  1,  and  the 
heterogeneous  group  is  mapped  to  habitat  2. 
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Figure  1.8:  Convergence  of  iterative  parameter  optimization.  During  the  parameter  opti¬ 
mization,  both  the  average  change  in  probability  for  components  of  the  emission  probability 
matrix  (red  line)  and  the  change  in  transition  rate  (black  line)  decrease  rapidly.  The  overall 
log-likelihood  of  the  observed  data  converges  in  concert  with  the  parameter  estimates  (blue 
line) . 
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Figure  1.9:  Reproducibility  of  inferred  habitats.  Sixty  independent  trials  of  the  iterative 
habitat  learning  process  were  performed.  Shown  are  the  frequencies  of  occurrence  of  the 
habitats  presented  in  Figure  1.1.  The  habitat  similarity  cut-off  for  counting  a  match  was 
an  average  emission  probability  difference  <  0.10  over  all  environmental  categories.  As 
expected,  the  least  abundant  habitat  in  our  data,  Hq  (see  Fig.  1.8)  is  the  least  reproducible, 
although  even  this  habitat  is  identified  in  83%  of  trials. 
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Figure  1.10:  Number  of  habitats  inferred  from  simulated  data.  We  generated  120  sim¬ 
ulated  datasets  and  compared  the  number  of  inferred  habitats  to  the  true  number.  Each 
dataset  was  randomly  assigned  between  2  and  6  habitats;  these  habitats  were  distributed 
over  environmental  categories  by  randomly  partitioning  the  interval  (0,1)  according  to  a 
uniform  distribution,  but  requiring  that  each  successive  partitioning  must  occur  at  a  larger 
value  than  the  last  (random  partitioning  without  this  restriction  leads  to  a  large  number  of 
similar  generalist  habitats).  These  habitats  were  mapped  onto  the  set  of  clusters  learned 
from  the  Vibrionaceae  data  (Figure  1A).  Shown  are  the  number  of  habitats  inferred  for  these 
trials  versus  the  number  actually  present  (a  small  amount  of  Gaussian  noise  was  added  to 
each  data  point  so  that  the  data  points  could  be  discerned).  These  results  suggest  that  the 
habitat  inference  algorithm  is  generally  conservative,  inferring  fewer  rather  than  more  than 
the  true  number  of  habitats. 
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Figure  1.11:  Inferred  habitat  associations  for  all  ancestors  of  sequenced  Vibrio  strains.  The 
rings  surrounding  the  tree  represent  the  season  (outer)  and  size  fraction  (inner)  from  which 
strains  were  isolated.  The  maximum  likelihood  assignment  of  nodes  to  habitats  is  shown 
for  all  nodes,  regardless  of  the  confidence  of  each  prediction  (only  confident  assignments 
are  shown  in  Figure  1.1  A).  Colored  circles  on  each  branch  indicate  the  habitat  assignment 
(H^-Hi?,  as  in  Figure  1.1A)  for  the  node  immediately  below  that  branch  (see  above  legend 
for  color  scheme).  Branch  lengths  are  adjusted  to  aid  visualization  and  do  not  represent 
evolutionary  distances. 
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Figure  1.12:  Comparison  of  habitat  inference  on  different  gene  phytogenies  ( hsp60  and 
mdh )  for  Vibrio  splendidus  strains.  (A)  Juxtaposition  of  habitats  learned  from  the  mdh  and 
hsp60  datasets  for  V.  splendidus  strains  only;  habitats  are  labeled  to  allow  comparison  with 
habitats  predicted  for  all  Vibrionaceae  in  Fig.  1.1.  Emission  probabilities  are  normalized 
by  the  total  number  of  isolates  obtained  in  each  environmental  sample  to  reduce  the  effects 
of  sampling  bias.  As  expected,  fewer  habitats  are  identified  from  the  mdh  phylogeny,  which 
has  a  lower  rate  of  nucleotide  divergence  and  thus  is  less  well-resolved.  (B)  Comparison 
of  habitat  assignments  to  nodes.  Because  it  is  difficult  to  map  the  internal  nodes  between 
topologically  distinct  trees,  the  habitat  assignment  for  the  last  common  ancestor  of  each 
pair  of  V.  splendidus  strains  was  compared.  If  both  corresponded  to  the  same  habitat  (He, 
He,  or  He  in  both  phylogenies),  they  were  considered  to  be  in  agreement,  otherwise  they 
were  considered  to  be  in  disagreement.  The  fraction  of  nodes  in  agreement  is  shown  as  a 
function  of  increasing  genetic  distance  between  the  pairs  of  strains  considered  (in  the  hsp60 
phylogeny).  The  black  and  red  lines  indicate  distances  that  include  50%  and  95%  of  strains 
within  the  same  cluster  in  main  text  Figure  1.1,  respectively. 
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Figure  1.13:  Influence  of  the  model  complexity/redundancy  parameter  on  inferred  habi¬ 
tats.  Clusters  are  merged  during  the  model  fitting  procedure  when  the  vectors  describing 
their  distribution  across  environments  are  more  than  90%  correlated.  As  this  cutoff  is  var¬ 
ied,  slightly  different  habitats  are  observed.  At  85%,  habitat  He  is  not  recovered;  at  higher 
values  additional  habitats  become  more  redundant  suggesting  that  the  90%  cutoff  allows 
conservative  recovery  of  characteristic  projected  habitats. 
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Figure  1.14:  Statistical  uncertainty  in  the  hsp60  gene  tree  by  constructing  100  trees 
using  non-parametric  bootstrap  re-sampling  from  the  hsp60  alignment.  Clades  supported 
in  greater  than  80%  of  bootstraps  are  indicated  with  a  black  dot. 
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Chapter  2 

Rapid  evolutionary  innovation 
during  an  Archean  Genetic 
Expansion 


Lawrence  A.  David,  Eric  J.  Aim 


This  chapter  is  presented  as  it  was  submitted  for  publication.  Corresponding 
Supplementary  Material  is  appended. 


Chapter  2 


Rapid  evolutionary  innovation 
during  an  Archean  Genetic 
Expansion 


A  natural  history  of  Precambrian  life  remains  elusive  because  of  the  rarity  of  microbial 
fossils  and  biomarkers  [79,  80].  The  composition  of  modern  day  genomes,  however, 
may  bear  imprints  of  ancient  biogeochemical  events  [81-83].  We  have  employed  an 
explicit  model  of  macroevolution  including  gene  birth,  transfer,  duplication  and  loss 
to  map  the  evolutionary  history  of  3,968  gene  families  across  the  three  domains  of  life. 
We  observe  that  horizontal  gene  transfer  (HGT)  is  the  primary  source  of  new  genes 
in  prokaryotes,  while  duplication  dominates  in  eukaryotes.  Inter-domain  gene  trans¬ 
fer  is  rare  compared  to  intra-domain  transfer  with  the  notable  exception  of  massive 
Bacteria-Eukarya  transfer  events  that  correspond  to  the  endosymbiosis  of  the  mito¬ 
chondria  and  chloroplasts  [84,  85].  Surprisingly,  we  find  that  a  brief  period  of  genetic 
innovation  during  the  Archean  eon  gave  rise  to  27%  of  major  modern  gene  families. 
Genes  born  during  this  period  are  especially  likely  to  be  involved  in  electron  trans¬ 
port,  while  later  genes  exhibit  a  gradually  increasing  usage  of  molecular  oxygen.  Our 
results  demonstrate  that  reconstructing  the  complex  interplay  between  organismal 
and  geochemical  evolution  over  Earth  history  is  becoming  a  tractable  goal. 
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Introduction 


Describing  the  emergence  of  life  on  our  planet  is  one  of  the  grand  challenges  of  the 
Biological  and  Earth  sciences.  Yet  the  roughly  three-billion-year  history  of  life  pre¬ 
ceding  the  emergence  of  hard-shelled  metazoans  remains  obscure  [79].  To  date,  the 
best  understood  event  in  early  Earth  history  is  the  Great  Oxidation  Event,  which  is 
believed  to  follow  the  invention  of  oxygenic  photosynthesis  by  the  ancestors  of  modern 
cyanobacteria  [86]  (though  the  precise  timeline  remains  controversial  [80]).  If  DNA 
sequences  from  extant  organisms  bear  an  imprint  of  this  event,  then  we  can  use  them 
to  make  and  test  predictions  [81-83];  e.g.,  genes  that  use  molecular  oxygen  will  be 
confined  to  a  group  of  organisms  emerging  after  the  Great  Oxidation  Event.  Transfer 
of  genes  across  species,  however,  can  obscure  patterns  of  descent  and  disrupt  our  abil¬ 
ity  to  correlate  gene  histories  with  the  geochemical  record  [87].  Widely  distributed 
genes,  for  example,  may  descend  from  a  Last  Universal  Common  Ancestor  (LUCA) 
as  widely  believed  to  be  the  case  for  the  translational  machinery  [88],  or  may  have 
been  dispersed  by  HGT  [12,  89],  as  in  the  case  of  antibiotic  resistance  cassettes. 

Methods  Overview 

We  developed  a  new  phylogenomic  method,  AnGST  (Analyzer  of  Gene  and  Species 
Trees),  to  account  for  the  confounding  effects  of  HGT  by  comparing  individual  gene 
phytogenies  to  the  phytogeny  of  organisms  (the  Tree  of  Life).  We  refer  to  this  process 
as  tree  reconciliation  and  provide  a  detailed  description  of  the  AnGST  algorithm  in 
the  Supplementary  Information.  Unlike  some  previous  methods  [24,  25,  27],  AnGST 
uses  the  topology  of  the  gene  family  tree  rather  than  just  its  presence/absence  across 
genomes  and  can  infer  duplication,  HGT,  and  loss  events.  Importantly,  AnGST  also 
accounts  for  uncertainty  in  gene  trees  by  incorporating  reconciliation  into  the  tree¬ 
building  process:  the  tree  that  minimizes  the  evolutionary  cost  function,  but  is  still 
supported  by  the  sequence  data,  is  chosen  as  the  best  gene  tree.  Simulated  trees 
inferred  with  this  method  are  more  accurate  than  trees  based  on  a  maximum  likeli¬ 
hood  model  of  sequence  data  alone  (Figure  2.5).  Thus,  tree  building  methods  such  as 
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AnGST  that  explicitly  model  macroevolutionary  events  may  have  utility  in  phyloge¬ 
netic  inference  [33] .  We  used  a  previously  described  Tree  of  Life  [90]  to  reconcile  gene 
families,  although  we  note  that  our  key  results  were  consistent  when  using  30  alterna¬ 
tive  reference  trees,  including  those  that  used  the  Archaea  or  Eukarya  as  outgroups 
(Figs.  2.11,  2.12).  Ensuring  proper  causality  in  a  large  reconciliation  (i.e.,  avoiding 
the  “grandfather  paradox”  in  which  a  gene  is  inferred  to  be  its  own  ancestor)  is  a 
computationally  intractable  problem  in  general  [31],  which  we  overcome  by  explicitly 
modeling  the  timing  of  evolutionary  events  based  on  a  chronogram  constructed  from 
our  reference  tree.  A  conservative  set  of  eight  temporal  constraints  was  selected  from 
the  geochemical  and  paleontological  literature  (Table  2.1),  and  the  PhyloBayes  soft¬ 
ware  package  was  used  to  infer  a  range  of  divergence  times  for  each  ancestral  lineage 
on  the  reference  tree  [91].  We  did  not  apply  temporal  constraints  to  lineage  ages  on 
the  gene  trees. 

Results 

Domain-specific  Macroevolutionary  Trends 

For  3,968  extant  gene  families  [106],  AnGST  predicted  a  total  of  109,452  speciation, 
38,575  HGT,  14,021  gene  duplication,  and  35,252  gene  loss  events  (Figure  2.1A).  The 
abundance  of  HGT  events  (on  average  9.7  per  gene  family)  underscores  the  evolu¬ 
tionary  importance  of  gene  transfer  in  prokaryotic  genome  structure  (Figure  2.15). 
Domain-specific  preferences  in  the  types  of  macroevolutionary  events  emerge  in  Fig¬ 
ure  2. IB.  On  a  per-gene  basis,  gene  transfer  is  2.1  times  more  likely  in  bacteria  than 
in  eukaryotes,  while  duplications  are  4.4  times  more  likely  in  eukaryotes  than  in  bac¬ 
teria.  The  rate  of  HGT  in  eukaryotes  is  likely  to  be  an  overestimate  because  we  did 
not  consider  eukaryote-only  gene  families.  The  bias  toward  duplication  in  eukary¬ 
otes  is  consistent  with  known  domain-specific  traits,  such  as  unequal  crossing-over, 
whole-genome  duplication  events,  and  reduced  selection  against  large  genome  sizes 
[16,  107,  108].  Interdomain  transfers  comprise  a  minority  (16.1%)  of  HGT  events 
but  exhibit  significant  over-representation  of  HGT  from  alpha-proteobacteria  to  an- 
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Event 

Constraint 

Evidence 

1 

Last  universal  common  ances¬ 
tor  arises 

<  3850  Ma 

Carbon  isotope  fractionation 
[92,  93] 

2 

Cyanobacteria  emerge 

>  2500  Ma 

Traces  of  an  aerobic  nitro¬ 
gen  cycle  [94] ,  changes  in 
redox- metal  enrichments  [95] , 
and  sulfur  isotope  fractiona¬ 
tion  data  [96,  97]  indicate  oxy¬ 
genic  photosynthesis;  traces 
of  2«-methylhopane  biomark¬ 
ers  [98]  indicate  cyanobacterial 
presence 

3 

Eukaryotes  diverge  from  Ar- 
chaea 

>  2670  Ma 

Preserved  sterane  biomarkers 
[98,  99] 

4 

Akinetes  diverge  from 

cyanobacteria  lacking  cell 
differentiation 

>  1500  Ma 

Akinete  microfossils  [100] 

5 

Archaeplastida  emerge 

>  1198  Ma 

Red  algae  microfossils  [101] 

6 

Animals  emerge 

>  635  Ma 

Preserved  demosponge  ster- 
anes  [102] 

7 

Tetrapods  emerge 

(a)  <  385  Ma 

(b)  >  359  Ma 

(a)  Tetrapod  precursor  dating 
[103];  (b)  Tetrapod  fossil  dat¬ 
ing  [104] 

8 

Buchnera  diverge  from  Wig- 
glesworthia 

>  160  Ma 

Fossil  history  of  Buchnera' s 
aphid  hosts  [105] 

Table  2.1:  Temporal  constraints  used  to  construct  chronogram.  Eight  temporal  con¬ 
straints  that  could  be  directly  linked  to  fossil  or  geochemical  evidence  were  used  to  estimate 
divergence  times  on  the  Tree  of  Life  (Figure  2.10). 
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cient  eukaryotes  (p—  3.3  x  10~'  Wilcoxon  rank  sum  test)  and  from  cyanobacteria  to 
plants  (p= 8.3  x  10—6  Wilcoxon  rank  sum  test).  These  results  likely  reflect  the  an¬ 
cient  endosymbioses  that  gave  rise  to  the  mitochondrial  and  chloroplast  organelles 
[84,  85].  Functional  analysis  of  HGT  from  the  alpha-proteobacteria  to  the  ancestral 
eukaryotes  reveals  significant  enrichment  for  energy  metabolism  genes  (p— 1.6  x  10“6 
Fisher’s  Exact  Test),  further  supporting  an  association  between  these  HGT  and  an 
energy-producing  endosymbiosis  (Figure  2.18).  HGT  from  cyanobacteria  to  Arabidop- 
sis  thaliana  are  also  enriched  for  energy-producing  genes  (p= 3.9  x  10-3  Fisher’s  Exact 
Test),  as  well  as  translation-related  genes  (p=4.4  x  10~5  Fisher’s  Exact  Test)  which 
likely  reflect  the  migration  of  70S  ribosomal  proteins  from  the  chloroplast  to  the  plant 
nucleus  [109]. 

An  Archean  Genetic  Expansion 

Gene  histories  reveal  dramatic  changes  in  the  rates  of  gene  birth,  duplication,  loss, 
and  HGT  over  geologic  time  scales  (Figure  2.2).  The  most  striking  feature  of  the 
overall  gene  flux  depicted  in  Figure  2  is  a  burst  of  de  novo  gene  family  birth  between 
3.33-2.85  Ga  which  we  refer  to  as  the  Archean  Genetic  Expansion  (AGE).  This  win¬ 
dow  gave  rise  to  26.8%  of  extant  gene  families  and  coincides  with  a  rapid  bacterial 
cladogenesis.  A  spike  in  the  rate  of  gene  loss  (~3.1  Ga)  follows  the  AGE  and  may 
represent  consolidation  of  newly  evolved  phenotypes,  as  ancestral  genomes  became 
specialized  for  newly  emerging  niches.  After  2.85  Ga,  the  rates  of  both  gene  loss  and 
gene  transfer  stabilize  at  roughly  modern-day  levels.  The  rates  of  de  novo  gene  birth 
and  duplication  after  the  AGE  appear  to  show  opposite  trends:  de  novo  gene  family 
birth  rates  decrease  and  duplication  rates  increase  over  time.  The  near  absence  of  de 
novo  birth  in  modern  times  likely  reflects  the  fact  that  ORFan  gene  families,  which 
are  widespread  across  all  major  prokaryotic  groups,  are  not  considered  in  this  study 
[110].  The  excess  of  gene  duplications  and  ORFans  in  modern  genomes  suggests  that 
novel  genes  from  both  sources  experience  high  turnover  and  rarely  persist  over  long 
evolutionary  time  scales. 

What  evolutionary  factors  were  responsible  for  the  period  of  innovation  marked  by 
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the  AGE?  While  we  cannot  provide  an  unequivocal  answer  to  this  question  using  gene 
birth  dates  alone,  we  can  ask  whether  the  functions  of  genes  born  during  this  time 
suggest  plausible  hypotheses.  In  general,  birth  of  metabolic  genes  is  enriched  during 
the  AGE,  especially  those  involved  in  energy  production  and  coenzyme  metabolism 
(Table  2.17),  but  further  inspection  also  reveals  an  enrichment  for  metabolic  gene 
family  birth  prior  to  the  AGE.  To  focus  on  specific  metabolic  changes  linked  to  the 
AGE  we:  (i)  grouped  genes  according  to  the  metabolites  they  used;  and  (ii)  we  directly 
compared  the  occurrence  of  these  metabolites  in  genes  born  during  the  AGE  to  their 
abundance  prior  to  the  AGE.  The  results  are  striking:  the  AGE-specihc  metabolites 
(positive  bars,  Figure  2.2  inset)  include  most  of  the  compounds  annotated  as  redox/e_ 
transfer  (blue  bars),  with  Fe-S-,  Fe-,  and  02-binding  gene  families  showing  the  most 
significant  enrichment  (False  Discovery  Rate  <  5%,  Fisher’s  exact  test).  Gene  families 
that  use  ubiquinone  and  FAD  (key  metabolites  in  respiration  pathways)  are  also 
enriched,  albeit  at  slightly  lower  significance  levels  (False  Discovery  Rate  <  10%). 
The  ubiquitous  NADH  and  NADPH  are  a  notable  exception  to  this  trend  and  appear 
to  have  played  a  role  early  in  life  history.  By  contrast,  enzymes  linked  to  nucleotides 
(green  bars)  exhibited  strong  enrichment  in  genes  of  more  ancient  origin  than  the 
AGE. 

The  observed  metabolite  usage  bias  suggests  that  the  AGE  was  associated  with  an 
expansion  in  microbial  respiratory  and  electron  transport  capabilities.  Proving  this 
association  to  be  causal  is  beyond  the  power  of  our  phylogenomic  model.  Yet  this 
hypothesis  is  appealing  because  more  efficient  energy  conservation  pathways  could 
increase  the  total  free  energy  budget  available  to  the  biosphere,  possibly  enabling 
the  support  of  more  complex  ecosystems  and  a  concomitant  expansion  of  species  and 
genetic  diversity.  We  note,  however,  that  while  the  use  of  oxygen  as  a  terminal  electron 
acceptor  would  have  significantly  increased  biological  energy  budgets,  oxygen-utilizing 
genes  are  only  enriched  toward  the  end  of  the  AGE  (Figure  2.14).  Thus,  the  earliest 
redox  genes  we  identified  as  part  of  the  AGE  were  likely  to  be  used  in  anaerobic 
respiration  or  oxygenic/anoxy genic  photosynthesis,  although  some  may  have  been 
co-opted  later  for  use  in  aerobic  respiration  pathways. 
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Phylogenomic  evidence  for  ancient  changes  in  global  redox  potential 

Our  metabolic  analysis  supports  an  increasingly  oxygenated  biosphere  following  the 
AGE,  as  the  fraction  of  proteins  utilizing  oxygen  gradually  increases  from  the  AGE 
until  the  present  day  (Figure  2.3;  p=3A  x  10~8,  two-sided  Kolmogorov-Smirnov  test). 
Further  indirect  evidence  of  rising  oxygen  levels  comes  from  compounds  that  are  sen¬ 
sitive  to  global  redox  potential.  We  observe  significant  increases  over  time  in  the  usage 
of  the  transition  metals  copper  and  molybdenum  (Figure  2.3;  False  Discovery  Rate 
<  5%,  two-sided  Kolmogorov-Smirnov  test),  which  is  in  agreement  with  geochemical 
models  of  these  metals’  solubility  in  increasingly  oxidizing  oceans  [82,  83]  and  with 
the  growth  of  molybdenum  enrichments  from  black  shales  that  suggests  molybdenum 
began  accumulating  in  the  oceans  only  after  the  Archean  eon  [111].  Our  prediction  of 
a  significant  increase  in  nickel  utilization  accords  with  geochemical  modeling  predic¬ 
tions  of  a  10X  increase  in  dissolved  nickel  concentration  between  the  Proterozoic  and 
modern  day  [82] ,  but  conflicts  with  a  recent  analysis  of  banded  iron  formations  that 
inferred  monotonically  decreasing  maximum  deposited  nickel  concentrations  from  the 
Archean  onwards  [112],  The  abundance  of  enzymes  using  oxidized  forms  of  nitrogen 
(N20  and  NO3)  also  grows  significantly  over  time,  with  1/3  of  nitrate-binding  gene 
families  appearing  at  the  beginning  of  the  AGE  and  3/4  of  nitrous  oxide-binding  gene 
families  appearing  by  the  AGEs  end.  The  timing  of  these  gene  family  births  provides 
phylogenomic  evidence  for  an  aerobic  nitrogen  cycle  by  the  Late  Archean  [94], 

One  striking  discrepancy  between  our  phylogenomic  patterns  and  geochemical  pre¬ 
dictions,  however,  is  a  modest  but  significant  increase  in  iron-using  genes  over  time 
(Figure  2.3;  False  Discovery  Rate  <  5%,  two-sided  Kolmogorov-Smirnov  test).  The 
cessation  of  iron  formation  deposition  roughly  1.8  Ga  and  ocean  chemistry  models  in¬ 
dicate  that  iron  usage  should  decrease  following  the  Archean,  as  declining  iron  solubil¬ 
ity  in  oxygenated  ocean  surface  waters  and  sulfide-mediated  iron  removal  from  anoxic 
deeper  waters  combined  to  reduce  overall  iron  bioavailability  [113].  The  counterin¬ 
tuitive  phylogenomic  prediction  may  reflect  the  confounding  effect  of  evolutionary 
inertia,  whereby  microbes  in  the  face  of  declining  iron  availability  could  have  found 
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more  success  evolving  a  handful  of  metal-acquisition  proteins  (e.g.  siderophores), 
rather  than  replacing  a  host  of  iron-binding  proteins.  Alternatively,  the  insolubility 
of  iron  in  modern  oceans  may  be  offset  by  large  existing  organic  pools  of  reduced  iron. 

A  precise  timeline  for  oxygen  availability  is  beyond  the  resolution  of  our  relaxed 
molecular  clock  approach  and  remains  a  contentious  topic  in  organic  geochemistry 
[80,  98,  99].  Nonetheless,  our  results  suggest  an  Archean  biosphere  containing  some 
of  the  basic  components  required  for  oxygenic  photosynthesis  and  respiration,  despite 
the  fact  that  appreciable  oxygen  levels  do  not  appear  in  the  geological  record  until 
much  later  (roughly  2.5  Ga)  [114,  115].  Although  our  results  are  consistent  with  recent 
biomarker-based  evidence  for  early  oxygenesis  [99],  special  caution  should  be  used  in 
comparing  the  molecular  and  geological  dates.  Divergence  times  for  deep  nodes  on 
our  reference  chronogram  are  uncertain  (Figure  2.16),  and  they  are  partially  based  on 
the  constraint  that  the  LUCA  could  be  as  old  as  the  earliest  evidence  for  life  (3.85  Ga) 
[92],  even  though  the  LUCA  is  likely  a  descendant  of  the  first  life  form.  Furthermore, 
although  the  PhyloBayes  dates  include  uncertainty  estimates  that  are  accurate  given 
the  assumptions  of  the  CIR  model  [91],  an  alternative,  semi-parametric  approach 
implemented  in  r8s  [116]  results  in  a  much  younger  date  of  2.75-2.5  Ga  for  the  AGE 
(compared  to  3.33-2.85  Ga  for  PhyloBayes)  which  is  closer  to  the  Great  Oxidation 
Event  (Figure  2.12).  Here  we  present  mainly  results  from  the  PhyloBayes  program, 
because  it  allowed  us  to  explicitly  account  for  uncertainty  in  the  timing  of  inferred 
events.  With  such  disparate  phylogenetic  estimates  of  the  timing  of  the  most  ancient 
lineages,  a  chronology  for  the  AGE  and  the  evolution  of  oxygen-producing  genes  will 
require  a  careful  integration  of  both  geochemical  and  genetic  data. 

Implications 

Using  just  eight  temporal  constraints  as  our  geochemical  and  paleontological  guides, 
we  have  shown  that  whole  genome  sequence  data  can  be  used  to  infer  details  of 
microbial  evolution  during  the  Archean  eon  and  can  recount  global  changes  in  redox- 
sensitive  compound  bioavailability  following  the  evolution  of  oxygenesis.  Still,  our 
phylogenomic  approach  represents  only  a  first  step  toward  linking  whole  genome 
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sequence  data  to  early  Earth  history.  By  connecting  events  in  gene  histories  to  events 
in  Earth  history,  hypotheses  of  enzyme  or  pathway  presence/absence  can  be  used 
to  make  testable  predictions  about  when  metabolic  signatures  should  first  appear  in 
the  geochemical  record.  Conversely,  geochemical  hypotheses  may  be  tested  against 
predictions  of  extant  metabolisms  (as  we  demonstrate  using  the  rise  of  oxygen).  This 
may  admit  useful  new  lines  of  evidence  for  geochemical  theories  that  suffer  from  gaps 
in  the  rock  record.  Successive  refinement  of  phylogenomic  models  against  geochemical 
constraints  may  eventually  yield  an  abundant  and  reliable  source  of  Precambrian 
fossils:  modern-day  genomes. 
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Figure  2.1:  Evolutionary  events  by  lineage.  (A)  The  number  of  nracroevolutionary  events 
is  mapped  to  each  lineage  on  an  ultrametric  Tree  of  Life  and  visualized  using  the  iToL 
website  [74].  Pie  chart  area  denotes  the  number  of  events,  and  color  indicates  event  type: 
gene  birth  (red),  duplication  (blue),  HGT  (green),  and  loss  (yellow).  (B)  The  average 
number  of  events  per  gene  copy  is  separated  by  domain  and  the  origins  of  HGT  events  are 
depicted:  Bacteria  (green),  Archaea  (beige),  Eukarya  (violet). 
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Figure  2.2:  Rates  of  macroevolutionary  events  over  time.  The  figure  shows  the  rates  of 
macroevolutionary  events  (colors  as  in  Figure  2.1)  as  a  function  of  time.  Shown  are  the 
average  rates  of  evolutionary  events  per  lineage  (events  /  10  Ma  /  lineage).  Events  that 
increase  gene  count  are  plotted  to  the  right,  and  gene  loss  events  are  shown  to  the  left  (yellow 
curve).  Genes  already  present  at  the  LUCA  are  not  included  in  the  analysis  of  birth  rates 
because  the  time  over  which  those  genes  formed  is  not  known.  The  AGE  was  also  detected 
when  alternative  chronograms  were  considered  (Figure  2.12).  Inset:  metabolites  or  classes  of 
metabolites  ordered  according  to  the  number  of  gene  families  that  use  them  that  were  born 
during  the  AGE  compared  to  the  number  born  before  the  expansion.  Metabolites  whose 
enrichments  are  statistically  significant  at  a  False  Discovery  Rate  <  10%  or  <  5%  (Fisher’s 
Exact  Test)  are  identified  using  one  or  two  asterisks,  respectively.  Bars  are  colored  by 
functional  annotation  or  compound  type  (functional  annotations  were  assigned  manually). 
Metabolites  were  obtained  from  the  KEGG  database  release  51.0  [117]  and  associated  with 
COGs  using  the  MicrobesOnline  September  2008  database  [118].  Metabolites  associated 
with  fewer  than  20  COGs  or  sharing  more  than  2/3  of  gene  families  with  other  included 
metabolites  are  omitted. 
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Figure  2.3:  Genome  utilization  of  redox-sensitive  compounds  over  time.  The  first  bar 
illustrates  a  gradual  increase  in  the  fraction  of  enzymes  that  bind  molecular  oxygen  predicted 
to  be  present  over  Earth  history  (jp=  1.3  x  ICG7,  two-sided  Kolmogorov-Smirnov  test).  Colors 
indicate  abundance  normalized  to  present-day  values.  The  lower  four  panels  group  transition 
metals,  nitrogen  compounds,  sulfur  compounds,  and  Ci  compounds.  The  fraction  of  each 
group’s  associated  genes  that  bind  a  given  compound,  normalized  to  present-day  fractions, 
is  shown  over  time  using  a  color  gradient.  Enclosed  boxes  show  raw  fractional  values  at 
three  time  points:  3.5  Ga  (left);  2.5  Ga  (middle);  and  the  present  day  (right).  For  example, 
18.9%  of  transition  nretal-binding  genes  are  predicted  to  have  bound  Mn  at  2.5  Ga,  a  value 
1.26  times  the  size  of  the  modern  day  percentage  of  15.0%.  Values  within  parentheses 
give  the  overall  number  of  gene  families  in  each  group.  To  determine  which  compounds 
showed  divergent  genome  utilization  over  time,  the  timing  of  copy  number  changes  for  each 
compound’s  associated  genes  was  compared  to  a  background  model  derived  from  all  other 
compounds.  Compounds  whose  utilization  significantly  differs  from  the  background  model 
are  marked  with  an  asterisk  (False  Discovery  Rate  <  5%,  two-sided  Kolmogorov-Smirnov 
test).  Nitrite  and  nitric  oxide  are  not  shown  due  to  their  COG-binding  similarity  to  nitrate 
and  nitrous  oxide,  respectively. 
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2.2  Supplementary  Material 

2.2.1  Overview 

We  developed  a  phylogenomic  method  that  we  named  AnGST  (Analyzer  of  Gene 
and  Species  Trees),  which  “reconciles”  any  observed  differences  between  a  gene  tree 
and  a  reference  tree  (species  tree)  by  inferring  a  minimal  set  of  evolutionary  events, 
including  horizontal  gene  transfer  (HGT),  gene  duplication  (DUP),  gene  loss  (LOS), 
speciation  (SPG)  and  exactly  one  gene  birth  or  genesis  event  (GEN).  Each  event  type 
is  assigned  a  unique  cost,  and  the  overall  sum  of  costs  associated  with  a  reconciliation 
is  minimized  (i.e.,  we  use  a  generalized  parsimony  criterion).  We  address  previously 
described  shortcomings  of  similar  parsimony-based  models  of  host-parasite  evolution 
[119]  by  accounting  for  phylogenetic  uncertainty  (using  a  new  approach  described 
below)  and  directly  estimating  event  costs  from  our  large  dataset.  We  divide  the 
gene-tree/species-tree  reconciliation  process  into  two  components: 

•  The  basic  reconciliation  step  assumes  a  known  gene  tree  and  species  tree  and 
identifies  the  set  of  evolutionary  events  (HGT,  DUP,  LOS,  SPC,  GEN)  needed 
to  explain  any  discordance  between  the  trees 

•  The  tree  amalgamation  step  accounts  for  gene  tree  uncertainty  by  incor¬ 
porating  tree  construction  into  the  reconciliation  process:  multiple  gene  tree 
bootstraps  are  provided  to  AnGST  and  the  algorithm  retains  and  combines 
bootstrap  subtrees  which  yield  the  most  conservative  reconciliation  consistent 
with  the  sequence  data. 

The  estimation  of  event  costs  from  the  input  data  is  based  on  reducing  large  fluc¬ 
tuations  in  ancient  genome  sizes.  This  method  is  presented  in  Section  2.2.5  together 
with  a  sensitivity  analysis  for  the  resulting  parameters. 

The  AnGST  software  package  is  implemented  in  the  Python  programming  lan¬ 
guage  and  can  be  downloaded  from:  http://almlab.mit.edu/ALM/Software/. 
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2.2.2  Basic  Reconciliation  Algorithm 

First  assumptions 

The  basic  reconciliation  step  requires  a  rooted,  strictly  bifurcating  gene  tree  G  and 
species  tree  S.  Each  tree  is  composed  of  a  set  of  nodes  linked  to  one  another  by  a 
set  of  connecting  edges.  We  assume  that  each  node  g  in  G  can  be  mapped  to  a  node 
s  in  S,  a  mapping  we  abbreviate  as  g:s.  This  mapping  describes  which  (extant  or 
ancestral)  genome  hosted  a  given  (extant  or  ancestral)  gene  copy.  Maps  are  known 
with  certainty  for  extant  genes,  but  must  be  inferred  for  ancestral  gene  copies. 

Algorithm  explanation 

Our  goal  in  gene/species  tree  reconciliation  is  to  recover  the  optimal  set  of  evolu¬ 
tionary  events  that  explain  any  topological  discordance  between  the  gene  and  species 
trees.  A  brute-force  search  through  all  possible  evolutionary  histories  is  intractable, 
as  the  number  of  possible  histories  grows  exponentially  with  increasing  tree  size  [120]. 
However,  for  a  given  gene  and  species  tree  pair,  there  are  only  \S\  possible  mappings 
for  the  root  node  of  the  gene  tree,  gr.  If  the  optimal  reconciliation  is  already  known 
for  each  possible  mapping  gr:sr,  where  sr  is  a  node  in  S,  a  new  outgroup  for  the 
gene  tree  can  be  added  (making  gr  a  child  of  the  new  root  node  gn),  and  optimal 
reconciliations  for  the  larger  gene  tree  can  be  quickly  computed  using  the  following 
method: 

1.  For  each  possible  pair  of  mappings  ( gr-sr ,  gn'-sn )  where  sr  and  sn  are  nodes  in 

(a)  Choose  the  most  parsimonious  explanation  for  how  a  gene  copy  in  sn  de¬ 
scended  into  ,sr. 

(b)  Concatenate  this  history  to  the  known  optimal  reconciliation  for  gr:sr,  to 
produce  the  optimal  reconciliation  for  the  ( gr:sr ,  r/n:sn)  pair 

2.  Identify  optimal  reconciliations  for  each  mapping  gn:sn  by  selecting  the  minimal 
overall  reconciliation  cost  associated  with  ( gr:sr ,  gn:sn )  as  sr  is  varied  over  the 
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nodes  of  S. 


Using  the  above  method,  the  reconciliation  problem  can  be  formulated  in  a  dy¬ 
namic  programming  framework,  yielding  computational  complexity  that  is  a  polynomial¬ 
time  function  of  gene  tree  size.  The  AnGST  program  implements  this  algorithm  as  a 
post-fix  traversal  of  the  gene  tree.  At  each  node,  reconciliations  from  child  subtrees 
are  combined  in  mini-reconciliations,  which  explain  how  the  gene  copy  at  g  coalesced 
from  two  child  copies  c i  andc2  (i.e.,  whether  HGT,  speciation,  or  duplication  oc¬ 
curred),  assuming  the  mappings  g:s,  cpsi,  and  c2:s2.  This  is  repeated  for  each  s,  Si, 
and  s2  G  S.  Mini-reconciliations  return  optimal  duplication-loss  or  HGT  scenarios  if  s 
is  the  last  common  ancestor  of  Si  and  s2,  or  if  s  is  identical  to  either  si  or  s2.  All  other 
combinations  of  s,  s i,  and  s2,  yield  mini-reconciliations  that  we  refer  to  as  complex 
scenarios.  We  include  these  scenarios  in  the  pseudocode  below  to  aid  understand¬ 
ing  of  basic  reconciliation  design,  but  we  do  not  provide  a  method  for  their  solution 
since  complex  scenarios  can  be  safely  ignored  without  loss  of  reconciliation  optimal¬ 
ity  (see  Running  Time  discussion  below).  If  g  is  a  leaf  node,  mini-reconciliations  are 
unnecessary  since  the  true  mapping  from  g  to  the  species  tree  is  known.  Once  all 
combinations  have  been  evaluated,  we  retain  the  optimal  reconciliation  associated 
with  each  possible  mapping  of  g  to  the  species  tree.  Pseudocode  for  the  reconciliation 
algorithm  is  provided  on  the  following  page  in  Python  style. 
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Pseudocode 


%  Main  % 

•  Reconcile (gene_tree . root) 

%  Methods  % 

•  define  Reconcile (node) : 

.  child_l,  child_2  =  ChildNodes (node)  %strictly  bifurcating  tree 

•  if  child_l  AND  child_2  are  null:  %is  a  leaf  node 

.  for  node_map  in  AllNodes (species_tree) : 

.  if  node_map  is  KnownHostGenome (node) : 

•  node . reconciliation_cost (node_nnap)  =  0  %correct  answer  is  known  for  leaves 
.  else: 

.  node .  reconciliation_cost  (node_nnap)  =  maxint 
.  return 

•  Reconcile (child_l)  %post-fix  traversal 

•  Reconcile (child_2) 

.  for  node_map  in  AllNodes (species_tree) :  %try  all  possible  hosts  for  ancestor 

.  for  child_l_map  in  AllNodes (species_tree) :  %try  all  possible  hosts  for  children 

•  for  child_2_map  in  AllNodes  (species__tree)  : 

•  events  =  MiniReconcile (node_map,  child_l_map,  child_2_map) 

•  prior_events_l  =  child_l.reconciliation_cost (child_l_map) 

•  prior_events_2  =  child_2.reconciliation_cost (child_2_map) 

•  overall_cost  =  Cost (events  +  prior_events_l  +  prior_events_2) 

•  cost_matrix  (node_nnap,  child_l_nnap,  child_2_nnap)  =  overall_cost 

•  for  node_map  in  AllNodes  (species__tree)  : 

.  node . reconciliation_cost  (node_niap)  =  Min (cost_matrix (node_map,  :,  :)) 

.  return 

•  define  MiniReconcile  (node  map,  child_l_map,  child_2_map)  : 

•  %  compute  DupLoss  scenarios 

•  if  node_map  is  ancestral  to  child_l_nnap  AND  child_2_map: 

.  if  node_map  is  last_common_ancestor  of  child_l_map  AND  child_2_nnap : 

.  %%%  See  Page46  for  DupLoss  pseudocode 

•  duploss_events  =  DupLoss  (node_nnap,  child_l_nnap,  child_2_nnap) 

•  else: 

.  duploss_events  =  ComplexScenario ( ) 

.  %CcmplexScenario()  not  irtplemented  —  see  Methods  Section  1.1.2  Running  Time 
discussion  for  explanation 

•  else: 

.  duploss_events  =  maxint  %  impossible  to  reconcile  with  only  dup-loss 

•  %  compute  HCT  scenarios 

•  if  node_map  is  child_l_nnap : 

.  hgt_events  =  {HGT  from  node_map  to  child_2_map} 

.  elif  node_map  is  child_2_map: 

.  hgt_events  =  {HGT  from  node_map  to  child_l_map} 

•  else: 

.  hgt_events  =  ComplexScenario ( ) 

•  return  MinCost (hgt_events,duploss_events) 


An  example  reconciliation: 


An  AnGST  reconciliation  of  two  simple,  but  discordant,  gene  and  species  trees  is 
provided  in  Figure  2.4.  Here,  we  assume  that  we  know  the  true  mappings  from  the 
leaves  in  G  to  S:  gp.s a,  g-i-^cr  gs-sB-  Because  AnGST  uses  a  post-fix  traversal  of  G 
and  the  mapping  of  G’s  leaves  to  S  is  trivial,  we  first  investigate  how  r/4  is  mapped 
to  nodes  in  S.  We  initialize  the  algorithm  by  assigning  infinite  reconciliation  cost  to 
leaf  mappings  which  deviate  from  the  known  leaf  mappings  (e.g.  gps#);  thus,  there 
is  only  one  valid  mapping  for  gi  and  r/2  • 

In  Scenario  a,  ,r/4  is  mapped  to  sa  ( g^'-SA )  and  we  infer  one  HGT  event  using 
the  mini-reconciliation  algorithm  (since  g4  is  mapped  to  the  same  lineage  as  one 
of  its  child  nodes).  Similarly,  if  we  consider  g^.sc,  we  infer  one  HGT  from  sc  to 
sa  (Scenario  (3).  In  the  case  of  g^.SE  (Scenario  7),  57  is  mapped  to  the  LCA  of 
sa  and  se  and  a  duplication- loss  scenario  is  invoked  by  the  mini- reconciler.  Other 
more  complex  scenarios  exist  (e.g.,  s4:sD),  but  these  can  be  ignored  without  affecting 
overall  reconciliation  optimality  (see  Running  Time  section  below).  Once  optimal 
reconciliations  have  been  found  for  each  possible  g±\s  mapping,  AnGST  recurses  to 
g5  and  repeats  the  process.  In  the  next  mini-reconciliation,  there  are  multiple  valid 
g^.s  mappings.  Thus  AnGST  must  iterate  through  prospective  mappings  for  both  g5 
and  r/4  (although  for  the  sake  of  illustrative  simplicity,  we  only  enumerate  a  fraction 
of  these  scenarios). 

In  the  first  mapping  shown  for  g5  (g^'.so,  <?4:sa,  .9.3 :sn),  there  is  1  SPC  (since 
sa  and  sb  are  direct  vertical  descendants  of  sp)  and  this  cost  is  added  to  the  1 
HGT  already  inferred  in  Scenario  a,  which  resulted  in  g^.SA-  For  the  combination 
0?5:sc ,g±'-Sc ,g3:sB) ,  a  cost  of  1  HGT  (because  g5  and  g 4  share  the  same  mapping)  is 
added  to  the  cost  for  Scenario  f3.  The  last  mapping  shown  is  (<?5:se,<?4:se,<?3:s.b).  A 
mini-reconciliation  that  posits  HGT  will  imply  forward-in-time  gene  transfers  -  an 
evolutionary  event  we  do  not  allow  (see  Temporal  constraints  on  HGT  below).  In¬ 
stead,  a  DUP  in  se  and  subsequent  losses  among  sa  and  sq  are  needed  to  correctly 
explain  the  mapping  of  s5:se,  <s4:s£,  and  g^'.SB-  The  g5  mapping  that  leads  to  the 
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optimal  reconciliation  is  a  function  of  the  chosen  evolutionary  event  costs.  With  a 
cost  structure:  Cspc= 0,  Chgt= 1,  Clos— 2,  Cdup= 3,  the  optimal  mappings  would 
be  g^.sp ,  Qa'Saj  and  the  associated  reconciliation  would  be  a  GEN  event  at  sp,  fol¬ 
lowed  by  SPG  at  sp,  and  an  HGT  from  sa  to  sp-  However,  if  Cspc= 0,  CWgt=10, 
Clos=(2i  Cpup=3,  the  optimal  mapping  would  be  g^'.sp,  g^'-SE,  and  the  associated 
reconciliation  would  be  an  initial  GEN  event  at  se,  followed  by  a  DUP  in  se,  2  SPCs 
each  at  sp  and  sp,  and  LOS  in  lineages  SA^  sp,  and  sp- 

Running  time 

0(|G|*|5|3)  is  an  upper  bound  on  run-time  complexity  of  AnGST,  where  \S\  and 
\G\  are  the  number  of  nodes  in  those  trees,  respectively.  Running  times  can  be 
significantly  reduced  without  loss  of  reconciliation  optimality,  however,  with  a  simple 
speedup.  When  performing  mini-reconciliations  on  all  combinations  of  g:s,  cpsi,  and 
C2-S2  for  s,  si,  and  S2  €  S,  any  complex  scenario  (s  is  not  si  or  S2,  and  s  is  not  the 
last  common  ancestor  of  both  Si  or  S2)  will  require  at  least  two  HGT  (one  to  .Si  and 
another  to  S2),  or  one  HGT  to  the  last  common  ancestor  of  .sq  and  S2  followed  by  a 
duplication-loss  scenario  originating  at  that  ancestor.  These  more  complex  scenarios 
will  therefore  always  be  suboptimal  with  respect  to  non-complex  scenarios  and  their 
evaluation  can  be  skipped  during  the  reconciliation  process.  The  resulting  reduction 
in  mapping  search  space  lowers  AnGST  run-time  complexity  to  0(|G|*|<S'|2).  When 
temporal  constraints  on  HGT  are  enforced  (see  below),  this  speedup  cannot  be  fully 
exploited,  as  nodes  ancestral  to  si  and  S2  are  potentially  optimal  values  for  s  in  HGT 
scenarios. 

In  practice,  on  3.0Ghz  single-cores  with  access  to  8GB  of  memory,  an  AnGST  run 
reconciling  100  bootstrap  trees  from  one  gene  family  against  a  reference  tree  of  100 
species  would  take  roughly:  0.1  minutes  for  gene  trees  with  10  leaves,  4  minutes  for 
gene  trees  with  50  leaves,  13  minutes  for  gene  trees  with  100  leaves,  27  minutes  for 
gene  trees  with  150  leaves,  37  minutes  for  gene  trees  with  200  leaves. 


Temporal  constraints  on  HGT 


If  provided  a  chronogram  as  a  reference  tree,  AnGST  will  restrict  the  set  of  possible 
inferred  gene  transfers  to  only  those  between  contemporaneous  lineages.  This  feature 
eliminates  the  possibility  of  inferring  multiple  HGT  events  which  are  chronologically 
impossible  [31].  Any  non-zero  chronological  overlap  is  sufficient  to  allow  transfers. 
But,  if  a  gene  transfer  is  inferred  from  node  .Si  to  node  sy,  subsequent  transfers  of  the 
gene  copy  in  s2  may  only  occur  with  lineages  which  exist  during  the  range  T\  n  T2, 
where  Tf  and  T2  are  the  times  spanned  by  the  parent  edges  of  .sq  and  s2,  respectively.  A 
feature  enabling  transfers  forward  in  time  (which  may  represent  “phantom  transfers” 
from  unsampled  taxa  [121])  has  been  built  into  AnGST,  but  remains  off  by  default 
and  was  not  used  in  our  analyses. 

Gene  tree  rooting 

Bootstrap  trees  are  assumed  to  be  unrooted.  All  possible  rootings  of  these  bootstrap 
trees  are  evaluated  during  the  reconciliation  process.  The  resulting  gene  tree  is  rooted 
on  the  branch  that  results  in  the  overall  lowest  reconciliation  score. 

2.2.3  Bootstrap  tree  amalgamation 

Errors  or  uncertainty  in  gene  phytogenies  can  lead  to  the  inference  of  spurious  macroevo¬ 
lutionary  events  [32]  and  is  a  particular  concern  for  deeply  branching  phytogenies 
[122],  AnGST  resolves  uncertainty  by  incorporating  reconciliation  into  the  tree¬ 
building  process:  the  tree  with  the  lowest  reconciliation  cost  is  chosen  from  a  large 
ensemble  of  trees  consistent  with  the  sequence  data.  To  generate  an  ensemble  of 
suitable  trees,  AnGST  considers  the  set  of  all  trees  that  contains  only  bipartitions 
observed  in  a  set  of  input  trees,  which  we  generate  with  non-parametric  bootstrap¬ 
ping.  Thus,  AnGST  typically  outputs  chimeric  trees  that  do  not  match  any  of  the 
input  bootstrap  trees  exactly,  although  every  bipartition  in  the  AnGST  tree  occurs 
in  at  least  one  of  the  bootstraps.  In  simulations,  we  observe  these  trees  to  be  signif¬ 
icantly  more  accurate  than  trees  based  on  sequence  likelihood  alone,  although  they 
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generally  have  lower  likelihood  (see  Chimeric  tree  fidelity  below  and  Figure  2.5).  Any 
number  of  bootstrap  trees  can  be  used,  but  we  found  limited  increase  in  accuracy  in 
simulated  data  as  a  result  of  using  more  than  10  (data  not  shown). 

We  implement  this  approach  in  the  following  manner  (see  Figure  2.6  for  an  exam¬ 
ple).  Given  n  gene  tree  bootstraps  (Gb,  G2,  ...  Gn }  and  a  reference  tree  S,  AnGST 
will  begin  the  basic  reconciliation  algorithm  starting  on  tree  G\.  Each  time  AnGST 
evaluates  an  internal  node  g  of  Gb,  it  also  evaluates  the  set  of  internal  nodes  /  =  {g\, 
g2,  ...  gk}  hi  other  bootstrap  gene  trees  that  define  the  same  bipartition  as  g.  The 
optimal  reconciliation  at  this  node  is  the  lowest  scoring  scenario/topology  observed 
in  any  of  the  bootstrap  trees.  That  is,  a  distinct  solution  is  computed  for  each  pos¬ 
sible  mapping  (gfs  for  gt  G  /,  s  G  S),  and  only  the  best  solution  is  retained  for  each 
value  of  s.  These  IS)  optimal  mappings  and  their  reconciliations  are  subsequently 
shared  across  all  the  nodes  in  I.  This  last  step  creates  “chimeric”  gene  trees,  as  the 
reconciliation  at  g  in  Gb  may  now  refer  to  a  topology  found  in  bootstrap  G). 

2.2.4  Simulation  and  Benchmarking 

Simulation 

Benchmarking 

We  used  simulations  to  benchmark  the  performance  of  AnGST.  Ten  independent 
gene  trees  births  were  simulated  on  each  of  the  199  extant  and  ancestral  lineages  of 
the  reference  tree.  A  simple  Poisson  statistics-based  model  of  HGT,  DUP,  and  LOS 
was  used  to  generate  random  gene  histories  and  associated  gene  trees;  the  average 
simulated  gene  family  underwent  0.21  HGT,  0.05  DUP,  0.76  SPC,  and  0.26  LOS  per 
extant  gene  copy.  (For  comparison,  our  analysis  of  the  COG  dataset  inferred  0.29 
HGT,  0.10  DLIP,  0.83  SPC,  and  0.27  LOS  per  extant  gene  copy.)  Synthetic  amino 
acid  sequences  were  generated  using  these  simulated  trees  and  the  SeqGen  software 
(v.  1.3.2)  [123].  Trees  were  reconstructed  from  the  synthetic  sequences  using  either  the 
BIONJ  algorithm  (implemented  in  PhyML),  PhyML  (v.2.4.5)  [72],  or  AnGST  via  100 
PhyML-generated  bootstrap  topologies  (see  Section  2.2.8  for  PhyML  parameters).  A 
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subset  of  75  gene  families  were  used  to  learn  costs  for  HGT  and  DUP  (see  Methods 
Section  2.2.5).  A  cost  combination  of  Chgt—^Cdup—^  minimized  genome  size  flux 
using  this  gene  family  subset  (compared  to  Chgt= 3,  Coup— 2  learned  for  the  COG 
dataset). 

Chimeric  tree  fidelity 

Following  reconciliation,  nodes  deep  in  the  interior  of  the  resultant  gene  tree  can 
contain  topologies  not  found  in  any  of  the  inputted  bootstraps  (although  all  possible 
bipartitions  of  these  subtrees  will  exist  in  at  least  one  of  the  bootstraps).  Thus,  the 
potential  search  space  of  topologies  is  vast.  We  tested  the  fidelity  of  the  chimeric 
gene  trees  learned  during  the  reconciliation  process  using  the  Robinson-Foulds  (RF) 
statistic  [124],  which  measures  the  number  of  bipartitions  not  shared  by  a  pair  of 
trees.  A  0  RF  score  indicates  perfect  concordance  (all  bipartitions  of  the  candidate 
and  reference  tree  are  identical)  and  increasing  RF  scores  denote  higher  phyloge¬ 
netic  discordance.  Analysis  of  the  225  gene  trees  with  a  minimal  level  of  complexity 
(more  than  10  leaves)  demonstrates  that  AnGST  trees  are  significantly  more  accu¬ 
rate  than  trees  generated  by  BIONJ  (p=9. 110-8  Wilcoxon  rank  sum  test)  or  PhyML 
alone  (p=l. 810-2  Wilcoxon  rank  sum  test).  Interestingly,  this  increase  in  topological 
accuracy  comes  with  a  likelihood  tradeoff  in  comparison  to  the  PhyML  algorithm 
(p=2. 710-39  Wilcoxon  rank  sum  test).  As  an  aside,  we  note  that  the  PhyML  like¬ 
lihoods  in  these  analyses  are  in  agreement  with  previous  simulations  which  showed 
PhyML  capable  of  constructing  trees  with  higher  likelihood  than  the  true  topologies 
[72], 

Inferred  birth  date  accuracy 

We  benchmarked  the  accuracy  of  gene  family  birth  dates  predicted  by  AnGST  using 
the  747  synthetic  gene  families  that  included  more  than  one  extant  gene  copy.  A 
comparison  of  inferred  birth  events  and  the  simulated  age  of  birth  events  is  shown  in 
Figure  2.7A.  There  is  a  strong  correlation  between  inferred  and  simulated  ages  (0.88) 
and  76%  of  births  are  predicted  to  within  250  My  of  their  simulated  age.  These 
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results  are  especially  promising  given  the  noisy  processes  (sequence  simulation  and 
phylogenetic  inference)  separating  simulation  of  a  gene  family  and  its  reconciliation. 
Moreover,  we  see  no  obvious  evidence  of  inference  bias  which  may  lead  to  the  false 
inference  of  a  birth  spike.  Direct  comparison  of  birth  counts  during  the  AGE  (2. 9-3. 3 
Ga)  to  simulated  births  during  the  same  period  (Figure  2.7B)  did  not  show  a  bias 
towards  over-counting  births.  We  did,  however,  observe  a  bias  toward  gene  birth  prior 
to  the  AGE,  suggesting  that  our  set  of  very  ancient  genes  (born  prior  to  3.3  Ga)  may 
be  inflated. 

2.2.5  Parameter  learning 

Minimizing  genome  size  flux 

We  address  the  problem  of  assigning  the  costs  to  each  event  type  in  a  manner  similar 
to  some  previous  studies  [25,  27,  125]:  we  use  predictions  of  ancestral  genome  sizes 
to  constrain  the  costs  Cdup  and  Chgt  (these  are  the  only  free  parameters  as  we 
can  assume  Clos= 1  arid  Cspc= 0  without  loss  of  generality).  However,  we  chose 
to  minimize  differences  in  genome  size  between  parent  and  child  nodes  (a  metric  we 
refer  to  as  genome  size  flux)  rather  than  constraining  overall  genome  size  over  time  for 
two  reasons:  first,  gene  acquisition  rates  may  not  have  been  constant  over  time  and 
ancient  genomes  may  have  been  smaller  (or  larger)  than  modern  day  genomes  [125]; 
second,  the  extinction  of  ancient  gene  families  would  lead  to  a  trend  of  smaller  inferred 
ancestral  genome  sizes  at  earlier  times  even  if  actual  genome  sizes  were  constant.  A 
grid  search  of  cost  space  showed  genome  size  flux  to  be  minimized  at:  Crgt= 3  and 
Cdup= 2  (Figures  2.8A,  2.9). 

Sensitivity  analysis 

We  investigated  the  extent  to  which  the  high  fraction  of  overall  gene  birth  detected 
during  the  Archean  Gene  Expansion  was  dependent  on  model  parameters  (Figure 
2.8B).  Gene  birth  patterns  were  invariant  over  a  broad  range  of  Cdup-  Gene  birth 
from  2. 8-3. 4  Ga  dissipated  only  at  low  Chgt  values.  However,  this  regime  of  C-hgt 
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resulted  in  unrealistic  genome  size  distributions:  ancestral  genomes  were  much  smaller 
than  present  day  ones,  and  most  genes  were  predicted  to  have  been  born  on  terminal 
branches  and  spread  via  HGT. 

2.2.6  Reference  tree  construction 

Building  a  Chronogram  of  Life 

We  used  a  previously  reported  Tree  of  Life  as  the  template  for  a  reference  chronogram 
[90].  This  template  was  constructed  using  a  concatenation  of  31  translation-related 
orthologs.  All  of  the  species  represented  in  our  gene  family  dataset  were  present 
in  this  template  tree.  Divergence  times  were  estimated  using  PhyloBayes  (v.2.3c) 
[126].  Since  autocorrelated  molecular  clock  models  have  been  shown  to  outperform 
uncorrelated  ones  in  some  cases  similar  to  this  study  [91],  we  ran  PhyloBayes  with 
a  CIR  process  model  of  rate  correlation.  Eight  sets  of  temporal  constraints  that 
could  be  directly  linked  to  fossil  or  geochemical  evidence  were  used  and  are  displayed 
in  Figure  2.10.  Benchmarking  PhyloBayes  runs  in  parallel  (n=95)  established  that 
predicted  divergence  times  and  model  likelihood  converged  after  a  burn-in  of  roughly 
1500  model  cycles.  Final  divergence  time  estimates  were  estimated  following  a  burn- 
in  of  2500  cycles,  after  which  trees  were  sampled  every  20  cycles  until  the  3500th 
cycle. 

2.2.7  Alternate  reference  trees 

We  tested  the  extent  to  which  the  AGE  was  sensitive  to  the  topology  of  the  reference 
tree  and  to  the  molecular  clock  model  used  in  chronogram  construction.  We  built  10 
separate  reference  phytogenies  using  non-parametric  bootstrapping  of  the  Ciccarclli  et 
al.  gene  alignment  [90]  (see  Section  2.2.8  for  PhyML  parameters)  and  rooted  each  with 
either  the  Bacteria,  the  Archaea,  or  the  Eukarya  as  the  outgroup.  Unequivocal  errors 
in  phytogeny  that  may  be  due  to  sequence  alignment  construction  errors  were  observed 
for  the  Bdellovibrio ,  Shigella ,  Treponema,  and  Helicobacter  pylori  taxa;  these  were 
resolved  by  manual  pruning  and  re-grafting.  Each  phylogeny  was  then  converted  to  an 
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ultrametric  tree  using  r8s  (v.1.71)  [116]  under  a  penalized  likelihood  model  (with  an 
additive  penalty  function,  truncated  Newton  nonlinear  optimization,  cross-validation 
enabled,  a  cross  validation  start  value  of  10,  a  cross  validation  smoothing  increment 
of  3,  and  the  number  of  smoothing  values  tried  set  to  4).  The  same  set  of  temporal 
constraints  was  used  as  for  PhyloBayes.  For  the  purposes  of  computational  economy, 
a  subset  of  250  COGs  was  randomly  selected  from  our  dataset  and  reconciled  against 
each  of  the  10  alternate  chronograms. 

Predicted  birth  ages  were  robust  to  the  usage  of  alternative  chronograms.  The 
median  gene  family  birth  date  difference  between  any  two  alternative  chronograms 
is  0.09  Ga  (Fig.  2.11)  Elevated  rates  of  gene  birth  during  the  Late  Archean  were 
observed  in  all  30  of  the  alternative  chronograms  and  on  average,  19%  of  the  250 
chosen  COGs  were  predicted  to  be  born  during  a  200-My  window  (Fig.  2.12).  How¬ 
ever,  the  timing  of  AGE-like  window  diverged  from  that  reported  by  PhyloBayes 
and  spanned  2. 7-2. 5  Ga  (compared  to  3. 3-2. 9  Ga  for  PhyloBayes).  Inspection  of  the 
r8s  chronograms  suggests  this  temporal  discrepancy  may  be  related  to  differences  in 
dating  the  cyanobacteria  under  the  two  models.  For  both  the  r8s  and  PhyloBayes 
chronograms,  AGE-like  events  coincided  with  the  relatively  brief  period  during  which 
the  major  bacterial  phyla,  such  as  the  Firmicutes  and  Proteobacteria,  diverged  from 
the  last  bacterial  common  ancestor.  This  period  of  compressed  cladogenesis  predates 
the  appearance  of  the  cyanobacteria  by  roughly  100-200  My  in  both  models.  Our 
r8s  analysis  places  the  initial  occurrence  of  the  cyanobacteria  at  2.5  Ga,  which  is 
precisely  the  minimum  age  constraint  for  the  appearance  of  this  clade  (see  Section 
2.2.6).  Using  the  same  constraints,  PhyloBayes  predicts  the  cyanobacteria  to  have 
emerged  3.0  Ga.  We  confirmed  the  importance  of  the  cyanobacteria  in  dating  the 
AGE  by  reducing  the  minimum  constrained  age  of  this  clade  during  r8s  chronogram 
construction;  the  resultant  model  yielded  a  younger  AGE  (data  not  shown). 

2.2.8  Gene  tree  construction 

Families  of  orthologous  genes  used  in  this  study  are  based  upon  functionally  anno¬ 
tated  orthologous  groups  from  the  COG  database  [127],  as  extended  to  a  wider  set 
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of  genomes  in  the  eggNOG  database  [106].  Due  to  computational  limitations,  we  re¬ 
stricted  this  study  to  a  subset  of  100  of  these  genomes  (11  eukaryotic,  12  archaeal,  and 
67  bacterial)  broadly  distributed  across  the  Tree  of  Life.  Sequences  were  downloaded 
from  the  eggNOG  database  in  September  of  2008. 

eggNOG-derived  families  were  filtered  to  ensure  usable  levels  of  sequence  conser¬ 
vation  with  the  aim  of  excluding  the  most  error-prone  phytogenies.  We  performed 
this  filtering  in  an  iterative  fashion:  First,  we  excised  poorly  aligned  regions  of  se¬ 
quence  [90,  128],  using  Gblocks  (0.91b)  [129]  with  the  minimum  number  of  sequences 
for  a  flank  position  set  to  half  the  number  of  sequences  in  the  alignment,  the  maxi¬ 
mum  number  of  contiguous  non-conserved  positions  set  to  8,  the  minimum  length  of 
a  block  set  to  2,  and  the  allowed  gap  positions  set  to  all.  Second,  we  excluded  genes 
with  more  than  20%  of  their  sequence  in  these  excised  regions  from  each  gene  family. 
Third,  Muscle  (v3.7)  [130]  was  used  with  default  settings  to  realign  the  remaining  se¬ 
quences.  This  process  then  returned  to  the  first  step,  unless  no  sequences  or  regions 
were  removed  in  the  first  or  second  steps  in  which  case  the  process  terminated.  Of 
the  original  4872  COGs,  788  lost  more  than  25%  of  their  original  gene  copies  during 
this  process;  these  COGs  were  considered  likely  to  be  error-prone  and  thus  excluded 
from  further  analysis.  Another  101  COGs  were  not  analyzed  due  to  their  high  gene 
copy  numbers  and  the  extreme  computational  demands  of  running  AnGST  on  those 
large  families.  A  distribution  of  gene  copy  numbers  within  each  gene  family  is  shown 
in  Figure  2.13. 

Phylogenetic  trees  were  constructed  for  the  remaining  gene  families  using  ver¬ 
sion  2.4.5  of  PhyML  [72]  and  the  following  parameters:  100  bootstrap  trees,  a  JTT 
substitution  model,  0.0  percentage  of  the  sites  were  invariable,  4  substitution  rate 
categories,  a  gamma  distribution  parameter  of  1.0,  a  BlONJ-based  starting  tree,  and 
both  tree  topology  and  branch  length  optimization  were  enabled. 
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Figure  2.4:  Example  of  a  basic  reconciliation.  An  AnGST  reconciliation  of  two 
simple,  but  discordant,  gene  ( G )  and  species  (S)  trees  is  shown.  The  mapping  of  leaves  of 
G  to  S:  gi'.SA,  gi'-sc,  g3-SB  is  indicated  with  color  (e.g.,  g\  and  are  both  shown  in  blue). 
Reconciliation  proceeds  in  a  post-fix  manner  through  the  gene  tree,  first  evaluating  possible 
mappings  from  g 4  to  nodes  in  the  S.  Once  the  reconciliation  process  is  completed  at  54, 
the  algorithm  continues  at  <75.  A  detailed  explanation  of  this  reconciliation  is  provided  in 
Section  2.2.2. 
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Figure  2.5:  AnGST  trees  are  more  accurate  than  likelihood  trees  in  simulation  studies. 
We  simulated  the  evolution  of  sequence  data  using  225  randomly  generated  gene  trees 
with  more  than  10  leaves.  Gene  trees  were  reconstructed  from  synthetic  sequence  data 
using  either  BIONJ  (red),  PhyML  (black),  or  AnGST  (blue).  Phylogenetic  accuracy  was 
evaluated  by  Robinson-Foulds  (RF)  score.  A  0  RF  score  indicates  perfect  concordance 
(all  bipartitions  of  the  candidate  and  reference  tree  are  identical)  and  increasing  RF  scores 
denote  higher  phylogenetic  discordance. The  logarithm  of  sequence  likelihood  given  each  tree 
model,  relative  to  the  likelihood  calculated  with  the  true  gene  topology,  is  plotted  on  the  X 
axis.  Mean  RF  scores  and  relative  log  likelihoods  are  drawn  with  rectangles  whose  height 
and  width  reflect  standard  errors  of  the  mean;  protruding  lines  are  standard  deviations. 
PhyML-based  trees  enjoy  significantly  higher  likelihood  scores  than  the  AnGST  chimeric 
trees  ( p  =  2.7 x  10~39  Wilcoxon  rank  sum  test),  but  the  AnGST-based  trees  are  significantly 
more  similar  to  the  correct  gene  tree  topologies  (p  =  1.8  x  ICR2  Wilcoxon  rank  sum  test). 
Outlying  points  beyond  axes  were  not  drawn  to  facilitate  viewing  mean  values,  but  were 
included  in  mean  and  significance  estimations. 
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Figure  2.6:  Amalgamation  algorithm  for  phylogenetic  uncertainty.  An  AnGST 
reconciliation  of  four  gene  tree  bootstrap  topologies  {Gi,  G2,  G3,  and  G4}  and  species  tree 
S  is  shown.  Leaf  nodes  on  each  bootstrap  map  to  leaves  on  S  according  to  color.  The 
reconciliation  begins  on  one  of  the  bootstrap  trees,  G\  (Step  1)  and  proceeds  to  an  interior 
node  (Step  2).  The  reconciliation  does  not  consider  other  topologies  for  this  subtree,  as  it 
only  contains  two  leaves.  When  the  reconciliation  reaches  the  parent  node  node  g±  (Step  3), 
AnGST  considers  subtrees  from  other  bootstraps  with  alternative  topologies  (but  identical 
leaves).  Corresponding  subtrees  are  found  on  G3  and  G4  and  rooted  at  nodes  53  and  54 
respectively  (Step  4).  Reconciliations  are  performed  in  parallel  at  <71,  $3,  and  54.  For  the 
mapping  of  these  internal  nodes  to  lineage  A  on  the  species  tree,  the  reconciliation  at  53  is 
optimal  (since  its  topology  matches  the  reference  one)  and  the  corresponding  subtree  in  G 3 
is  substituted  for  the  mappings  gi'.SA  and  g^'-SA- 
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Simulated  Birthdate  [Ga] 


Figure  2.7:  Benchmarking  AnGST  inference  accuracy.  A)  A  scatter  plot  of  simu¬ 
lated  gene  family  birth  dates  and  inferred  birth  dates.  Points  drawn  signify  midpoints  of 
branches  associated  with  birth  events.  A  slight  amount  of  Gaussian  noise  with  distribution 
N(n= 0,  cr=0.025)  has  been  added  to  each  point  so  that  overlapping  points  can  be  distin¬ 
guished.  The  correlation  coefficient  is  0.88  and  76%  of  predicted  births  are  within  250  My 
(bounded  by  red  dashed  lines)  of  their  true  ages.  B)  Birth  prediction  bias  is  plotted  as  a 
function  of  time.  Predicted  births  have  been  normalized  by  the  number  of  simulated  births 
associated  with  a  given  age.  The  AGE  (2. 9-3. 3  Ga)  is  highlighted  in  pink. 
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Genome  size  flux  sensitivity  to  HGT  and  DUP  costs 


Figure  2.8:  AnGST  parameter  learning  and  sensitivity  analysis.  A)  We  performed 
a  grid  search  over  the  costs  Chgt  and  Coup  with  the  intention  of  minimizing  average 
genome  size  flux  between  inferred  ancestral  genomes.  The  costs  Clos  and  Cspc  were  fixed 
at  1.0  and  0.0,  respectively.  Flux  can  clearly  be  minimized  along  the  Chgt  axis,  but  is 
less  sensitive  to  changes  in  Coup-  A  minimum  point  does  exist,  however,  at  Chgt= 3-0, 
Cdup=2- 0.  B)  A  sensitivity  analysis  for  our  detection  of  a  high  fraction  of  births  from  2.8- 
3.4  Ga  was  performed  over  the  same  parameter  space  evaluated  in  A).  Comparable  fractions 
of  overall  gene  birth  to  the  AGE  were  detected  in  parameter  space  near  the  genome  size 
flux  minimum.  Note  that  axes  are  reversed  in  panels  A  and  B  in  order  to  facilitate  viewing 
parameter  sensitivity  landscapes. 
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Figure  2.9:  Inferred  ancient  genome  sizes  for  Chgt=3- 0,  Cdc/p=2.0.  Circle  areas 
scale  absolutely  with  genome  sizes.  Our  optimization  metric,  genome  size  flux,  aims  to 
minimize  the  average  difference  between  parent  and  child  genomes.  Ancestral  genomes  are 
predicted  to  be  smaller  than  modern  day  ones;  this  may  reflect  the  evolution  of  increasingly 
complex  genomes,  and/or  the  extinction  of  ancestral  gene  families.  Genome  sizes  for  the 
LUCA  (183  genes)  and  the  modern-day  genome  of  E.  coli  (2507  genes)  are  labeled  in  dark 
blue.  Metazoan  genomes  appear  only  slightly  larger  than  prokaryotic  ones  because  the 
COGs  used  in  this  study  were  originally  defined  using  only  unicellular  organismsTatusov 
2000,  which  thus  biased  our  analyses  of  eukaryotic  genomes  towards  only  microbially-related 
genes. 
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Event 

Constraint 

1 

Last  universal 
common  ancestor 
emerges 

>  3850  Ma 

2 

Cyanobacteria 

emerge 

>  2500  Ma 

3 

Eukaryotes  diverge 
from  Archaea 

>  2670  Ma 

4 

Akinetes  diverge  from 
cyanobacteria  lacking 
cell  differentiation 

>  1500  Ma 

5 

Archaeplastida 

emerge 

>1198  Ma 

6 

Animals  emerge 

>  635  Ma 

7 

Tetrapods  emerge 

(a)  <  385  Ma 

(b)  >  359  Ma 

8 

Buchnera  diverge 
from  Wigglesworthia 

>  160  Ma 

Figure  2.10:  Temporal  constraints.  Eight  fossil  and  biogeochemical  constraints  were 
used  to  constrain  the  chronogram  (evidence  cited  can  be  found  in  Table  2.1).  Those  con¬ 
straints  are  overlaid  onto  the  reference  phylogeny. 
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Figure  2.11:  Sensitivity  of  predicted  birth  ages  to  variation  in  reference  tree 
topology.  Ten  bootstraps  of  the  reference  tree  were  rooted  using  either  the  Bacteria, 
Archaea,  or  Eukarya  as  an  outgroup  and  subsequently  processed  with  r8s,  producing  30 
alternative  reference  chronograms.  Birth  dates  were  inferred  for  250  gene  families  using 
each  chronogram.  We  graph  1000  random  combinations  of  alternative  chronogram  pairs 
and  gene  families  in  the  scatter  plot  above  (correlation  coefficient  =  0.86).  The  median 
gene  family  birth  date  difference  between  any  two  alternative  chronograms  is  0.09  Ga. 


83 


1 


Time  [Ga] 


Figure  2.12:  Gene  family  birth  using  30  alternative  reference  tree  topologies. 

Shown  above  are  cumulative  distribution  functions  (CDFs)  of  total  COG  birth  over  time  for 
the  30  alternative  reference  chronograms  (light  gray  lines).  Mean  CDFs  for  the  Bacteria, 
Archaea,  and  Eukarya  as  outgroups  are  shown  using  green,  red,  and  blue  dashed  lines, 
respectively.  Overall  (solid  black  line),  the  period  2. 7-2. 5  Ga  witnesses  a  gene  family  birth 
spike  of  on  average  0.23  families  born  per  1  Ma  and  accounts  for  the  birth  of  19%  of  the 
COG  families  studied.  By  contrast,  birth  rates  average  0.07  families  born  per  1  Ma  from 
2.5  Ga-present  day. 
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Figure  2.13:  Histogram  of  COG  family  sizes.  The  median  COG  family  in  our  dataset 
possesses  18  gene  copies,  and  93%  of  COG  families  have  100  or  fewer  gene  copies. 
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Figure  2.14:  (^-utilizing  gene  birth  over  time.  The  fraction  of  compound-binding 
COG  births  which  bind  02  is  shown  over  time.  A  chi-square  test  was  used  to  compare 
the  overall  number  of  COGs  born  and  the  number  of  02-binding  COGs  born  in  100  My 
windows:  prior  to  the  AGE  (3.7  Ga),  at  the  height  of  the  AGE  (3.25  Ga),  and  at  the  tail  of 
the  AGE  (2.85  Ga).  Comparisons  with  p  <  0.05  are  denoted  with  asterisks  on  the  graph. 
These  data  suggest  that  changes  in  Oo  usage  came  toward  the  end  of  the  AGE. 
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Figure  2.15:  HGT  counts  vs.  gene  family  size.  The  average  gene  family  reconciliation 
yields  9.7  inferred  HGT  events.  The  number  of  HGT  events  inferred  grows  with  the  number 
of  gene  copies  in  a  COG  family.  Gene  family  HGT  counts  also  grow  with  the  age  of  the  last 
common  ancestor  of  all  genomes  represented  in  the  family,  suggesting  that  HGT  is  more 
frequent  among  gene  families  spanning  wider  phyletic  range.  We  note  that  y-intercepts  for 
the  above  line  fittings  have  been  forced  to  equal  0. 
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Figure  2.16:  Confidence  intervals  for  divergence  times  on  reference  chronogram. 

Confidence  intervals  (95%)  were  estimated  by  PhyloBayes  and  are  shown  next  to  each 
divergence  point  on  the  tree.  Values  are  in  units  of  Ga. 
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Figure  2.17:  Function  of  gene  births  prior  to  and  during  the  AGE.  Functional 


enrichment  of  gene  birth  from  2. 8-3. 3  Ga  is  shown  for  the  20  COG  functional  categories.  A 
two-tailed  Fisher  exact  test  was  used  to  compute  the  p-value  of  a  difference  in  total  COG 
births  prior  to  the  AGE  vs.  during  the  AGE,  for  each  functional  category  (last  column). 
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Figure  2.18:  Biases  in  gene  function  associated  with  ancient  endosymbioses. 

Shown  here  is  a  functional  breakdown  of  HGT  from  cyanobacteria  into  Arabidopsis  thaliana, 
and  from  the  alpha-proteobacteria  to  ancient  eukaryotes  (which  we  defined  as  eukaryotic 
lineages  predating  the  divergence  of  Arabidopsis).  The  significance  of  functional  enrichments 
for  genes  associated  with  the  chloroplast  are  calculated  by  comparing  the  observed  number 
of  HGT  from  cyanobacteria  into  the  plant  lineage,  to  the  total  number  of  HGT  originating 
in  the  cyanobacteria,  so  as  to  account  for  functional  biases  associated  with  HGT  out  of  the 
cyanobacteria.  Similar  statistics  are  performed  on  mitochondria-related  genes.  P-values 
that  fall  below  a  5%  False  Discovery  Rate  cutoff  in  are  underlined. 
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Chapter  3 


Building  prokaryotic  species  trees 
from  thousands  of  gene  trees 

3.1  Abstract 

Sequenced-based  approaches  for  reconstructing  prokaryotic  species  trees  from  more 
than  one  gene  utilize  either  the  concatenation  of  sequences  (supermatrices)  or  the 
merger  of  separate  gene  trees  (supertrees)  [131].  Yet,  supermatrices  ignore  differ¬ 
ences  in  evolutionary  rate  and  nucleotide  compositional  bias  between  concatenated 
genes,  which  can  ultimately  reduce  the  accuracy  of  inferred  phylogenetic  trees  [132], 
Supertree  methods  can  account  for  these  inter-gene  heterogeneities,  but  a  commonly- 
used  supertree  technique,  matrix  representation,  violates  the  site-independence  as¬ 
sumption  underlying  many  phylogenetic  construction  algorithms.  Here,  I  extend  an 
alternative  supertree  method  known  as  Gene  Tree  Parsimony  (GTP),  which  chooses 
the  species  tree  as  the  topology  that  requires  the  least  number  of  duplication  events 
to  be  inferred  when  compared  to  a  set  of  gene  trees  [133].  My  GTP  model,  named 
GAnG,  can  account  for  horizontal  gene  transfer  events  (HGT)  and  is  suitable  for 
use  with  prokaryotic  gene  trees.  Preliminary  GAnG  analysis  of  250  archaeal  gene 
trees  built  from  a  subset  of  the  sequenced  archaeal  genomes  supports  hypotheses  that 
Nanoarchaeota  diverged  from  the  last  ancestor  of  the  Archaea  prior  to  the  Crenar- 
chaeota/Euryarchaeota  split,  but  also  appears  sensitive  to  HGT  artifacts  between  the 
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Crenarchaeota  and  the  Thermoplasma. 
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3.2  Introduction 


Prokaryotic  species  phylogenies  are  frequently  built  by  concatenating  sequences  from 
more  than  one  gene  [131],  in  order  to  sample  a  range  of  phylogenetically-informative 
characters  [134],  and  to  mitigate  the  role  of  topological  artifacts  caused  by  long  branch 
attraction  [122,  135]  and  nucleotide  composition  bias  [136,  137].  These  genes  can 
be  restricted  to  “core”  sets  that  are  topologically  congruent  [90,  138],  under  the 
assumption  that  horizontal  gene  transfer  (HGT)  artifacts  would  only  be  present  if 
identical  HGTs  affected  all  of  the  genes  in  the  core  set.  However,  reconstructions  of 
organismal  histories  using  core  genes  have  been  criticized  for  ignoring  the  majority 
of  phylogenetic  signal  present  in  genomes  and  have  been  referred  to  as  “The  tree[s] 
of  one  percent”  [139].  In  order  to  claim  that  species  trees  represent  overall  genome 
evolution,  computational  tools  are  required  that  can  incorporate  genetic  information 
from  hundreds,  or  even  thousands,  of  genes. 

Previous  studies  that  have  built  prokaryotic  species  trees  from  multiple  gene  fam¬ 
ilies  have  utilized  either  “supertree”  or  “supermatrix”  approaches  [90,  138,  140,  141]. 
Under  the  supermatrix  approach,  sequences  from  orthologous  gene  families  are  con¬ 
catenated  into  a  single  sequence,  which  can  subsequently  be  inputted  into  a  standard 
phylogenetic  construction  algorithm.  This  method  is  robust  to  gene  families  with  lim¬ 
ited  taxonomic  distribution,  as  simulations  have  shown  supermatrices  to  perform  well 
even  in  regimes  where  only  10%  of  species  possess  a  given  gene  copy  [142] .  However, 
the  concatenation  of  prokaryotic  genes  risks  assembly  of  sequences  with  divergent 
evolutionary  histories  due  to  HGT  [143].  Supermatrix  approaches  must  therefore  as¬ 
sume  that  the  phylogenetic  signal  from  vertically-descended  genes  overwhelms  any 
signal  from  transferred  regions  of  the  alignment.  Moreover,  species  trees  cannot  be 
constructed  in  a  computationally-efficient  way  if  differences  in  gene  evolutionary  rate 
or  nucleotide  composition  are  taken  into  account.  Simulations  have  shown  that  par¬ 
titioning  concatenations  and  fitting  these  parameters  on  a  per-gene  basis  improved 
the  fidelity  of  reconstructed  phylogenies  in  nearly  all  reported  scenarios,  and  in  some 
cases  enabled  the  reliable  recovery  of  clades  that  were  never  inferred  by  a  conven- 
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tional  supermatrix  procedure  [132],  However,  supermatrices  that  model  evolutionary 
heterogeneity  between  genes  cannot  utilize  existing  tree  building  algorithms  and  have 
so  far  required  exhaustively  searching  tree  space  in  order  to  identify  an  optimal  tree 
[132,  134], 

Supertree  methods  provide  an  alternative  phylogenetic  framework  that  can  model 
evolutionary  parameters  on  a  per-gene  basis,  while  still  constructing  a  species  phy- 
logeny  in  a  computationally-efhcient  manner.  Under  this  framework,  gene  trees  are 
built  separately  for  each  orthologous  gene  family  and  subsequently  merged  to  form 
a  species  tree.  Supertree  methods  are  distinguished  by  how  gene  tree  merger  takes 
place.  According  to  the  most  popular  merger  method  [144],  matrix  representation 
using  parsimony  (MRP),  a  binary  character  matrix  is  built  that  possesses  a  column 
for  each  internal  node  in  the  set  of  gene  trees,  and  a  row  for  each  of  the  species  sam¬ 
pled.  Species  with  a  gene  copy  descended  from  a  given  internal  node  all  share  a  “1” 
in  the  corresponding  column  of  the  matrix,  and  all  other  species  share  a  “0”  [145].  A 
consensus  tree  is  formed  by  inputting  the  character  matrix  into  a  sequence-based  phy¬ 
logenetic  construction  algorithm.  Because  the  supertree  framework  allows  trees  to  be 
built  for  individual  gene  families,  the  model  is  capable  of  accounting  for  differential 
evolutionary  rates  and  nucleotide  compositions  between  genes,  unlike  conventional 
supermatrices. 

However,  the  MRP  algorithm  in  particular  has  been  criticized  for  its  analogy 
between  the  tree  matrix  and  a  nucleic  acid  sequence  matrix  [133].  This  analogy 
violates  the  site-independence  model  assumption  of  many  phylogenetic  reconstruction 
algorithms  [76],  since  sibling  leaves  on  a  gene  tree  will  be  partitioned  similarly  for 
each  MRP  matrix  column  that  corresponds  to  one  of  these  leaves’  ancestral  nodes. 
In  practice,  this  site  dependence  should  manifest  as  a  supertree  bias  towards  subtree 
topologies  from  large  gene  trees  with  many  ancestral  nodes  and  therefore  more  sites 
in  the  tree  matrix.  This  bias  is  likely  deleterious,  as  large  gene  trees  can  be  caused 
by  frequent  HGT  and  duplication  events  that  impede  partitioning  gene  families  into 
orthologous  groups. 

Gene  tree  parsimony  (GTP)  provides  an  alternative  supertree  merger  criterion 
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that  has  received  less  statistical  criticism  than  MRP,  but  that  also  cannot  be  used 
on  prokaryotic  gene  trees  in  present  implementations.  First  introduced  by  Slowinski 
in  1997,  the  GTP  approach  chooses  the  species  tree  as  the  topology  that  requires 
the  least  number  of  duplication  events  to  be  inferred  when  species  and  gene  trees  are 
compared  (a  step  called  a  reconciliation)  [146].  This  reconciliation-based  approach 
avoids  the  matrix  construction  step  that  has  led  to  criticism  of  MRP  supertree  meth¬ 
ods.  However,  existing  implementations  of  GTP  can  only  model  gene  duplication  and 
gene  loss  events  [147,  148],  limiting  their  usage  to  eukaryotic  phylogenies. 

Here,  I  enable  the  use  of  GTP  supertrees  on  prokaryotic  trees,  through  a  new 
program  that  I  have  named  GAnG.  This  algorithm  reconciles  gene  trees  and  species 
trees  using  a  generalized  parsimony  model  I  previously  developed,  the  Analyzer  of 
Gene  &  Species  Trees,  or  AnGST  [149],  which  accounts  for  HGTs,  as  well  as  gene 
duplication  and  gene  loss  events,  in  gene  family  evolution.  Experiments  with  simu¬ 
lated  data  show  that  GAnG  can  accurately  quantify  species  tree  accuracy.  As  a  proof 
of  concept,  I  also  used  the  algorithm  to  learn  a  phylogeny  of  12  sequenced  archaeal 
species  from  250  gene  trees  with  divergent  topologies  and  that  have  likely  undergone 
HGT.  Results  of  this  analysis  support  the  basal  position  of  Nanoarchaeum  equitans 
on  the  archaeal  tree. 
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3.3  Methods 


3.3.1  Approach  overview 

The  GAnG  algorithm  is  an  iterative  process  composed  of  four  primary  steps: 

1.  Initial  topology  generation  and  reconciliation:  An  initial  reference  tree 
is  constructed  using  one  gene,  or  a  concatenation  of  multiple  genes.  Individual 
gene  trees  are  also  constructed  for  all  orthologous  gene  families.  After  trees 
have  been  built,  each  of  the  gene  trees  is  reconciled  against  the  species  tree 
using  AnGST. 

2.  Proposal  of  tree  refinements:  The  HGTs  inferred  from  reconciliations  be¬ 
tween  the  species  tree  and  gene  trees  are  mined  to  propose  subtree-prune- and- 
regraft  (SPR)  moves.  These  SPRs  are  used  to  generate  candidate  species  tree 
topologies  (Section  3.3.2).  Alternatively,  new  rootings  of  the  species  tree  can 
be  evaluated  (Section  3.3.3). 

3.  Evaluation  of  candidate  refinements:  Each  candidate  species  tree  is  eval¬ 
uated  by  comparison  to  the  set  of  gene  trees  using  AnGST,  as  described  in 
Section  3.3.4. 

4.  Selection  of  a  new  species  tree:  The  best  scoring  candidate  tree  is  identified. 
If  this  tree  has  a  better  reconciliation  score  than  the  current  species  tree,  the 
species  tree  is  replaced  with  the  candidate  tree  and  the  algorithm  returns  to 
Step  2.  Otherwise,  the  process  terminates  and  returns  the  present  species  tree. 

3.3.2  Generating  candidate  species  trees 

GAnG  searches  species  tree  space  using  an  iterative  subtree-prune-and-regraft  (SPR) 
strategy  that  generates  a  candidate  species  trees  from  an  existing  species  tree  by 
pruning  off  subtrees  and  regrafting  them  onto  new  locations  on  the  tree.  This  heuristic 
approach  is  necessary  because  it  is  unlikely  that  a  polynomial-time  algorithm  exists 
for  finding  an  optimal  species  tree  using  AnGST  [150].  An  exhaustive  search  strategy 
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is  also  not  possible,  since  the  number  of  possible  rooted  binary  trees  grows  super- 
exponentially  with  the  number  of  species  s,  following  the  function  (2s  — 3)!!  [151]  (for 
the  14  genomes  in  Fig.  3.2,  there  are  nearly  eight  trillion  possible  organismal  trees). 

SPR  moves  enable  dramatic  changes  in  tree  topology  using  relatively  few  topo¬ 
logical  operations  and  are  used  by  tree  construction  algorithms  like  PhyML  3.0  [152], 
Fast  Tree  [153]  and  RaxML  [154]  to  avoid  being  caught  in  local  minima  in  tree  space. 
An  SPR-based  strategy  for  refining  species  trees  has  the  added  advantage  of  com¬ 
plementing  AnGST  reconciliation  process.  High-frequency  HGTs  can  be  eliminated 
if  an  SPR  between  the  transferred  nodes  (in  the  reverse  direction  of  the  HGT)  is 
performed  on  the  species  tree.  Consequently,  GAnG  generates  a  candidate  species 
tree  for  each  of  the  SPR  moves  associated  with  the  100  most  frequently  inferred  HGT 
on  the  present  species  tree. 

3.3.3  Rooting  the  species  tree 

Unlike  typical  supermatrix  or  supertree  methods,  GAnG  does  not  require  an  outgroup 
sequence  to  to  infer  a  rooted  species  tree.  Since  the  species  tree  root  orients  the 
direction  of  vertical  inheritance  at  internal  nodes,  alternative  rootings  of  the  same 
unrooted  species  tree  will  have  distinct  AnGST  scores  when  reconciled  against  a 
gene  tree.  Therefore,  in  addition  to  incorporating  SPR  moves,  the  candidate  tree 
generation  routine  can  also  vary  the  position  of  the  root  node  on  the  species  tree. 

3.3.4  Scoring  a  candidate  species  tree 

A  score  S  for  a  putative  species  tree  topology  T  quantifies  how  well  the  species  tree 
fits  the  set  of  gene  trees  according  to  the  AnGST  model: 

S{T)  =  X  Rec(G|T)  (3.1) 

G 

where  Rec(G|T)  is  the  AnGST  reconciliation  score  for  a  gene  tree  G,  given  the  can¬ 
didate  species  tree  T.  The  AnGST  model  assigns  a  score  of  3,  2,  and  1  to  each  HGT, 
duplication,  and  loss  event  inferred,  respectively  [149]. 


3.4  Preliminary  results 


I  have  performed  two  preliminary  tests  using  the  GAnG  algorithm,  using  simulated 
and  real-world  data  generated  from  my  previous  study  of  microbial  genome  evolution 
[149].  These  tests  suggest: 

1.  The  AnGST-based  scoring  function  can  discern  between  species  trees  of  varying 
accuracy  (Section  3.4.1) 

2.  HGTs  can  be  used  to  propose  SPR  moves  on  the  species  tree  that  lead  to  lower 
AnGST  reconciliation  scores  (Section  3.4.2). 

3.4.1  AnGST  scores  increase  with  true  species  tree  permu¬ 
tation 

1  evaluated  how  AnGST  reconciliation  scores  changed  as  increasing  phylogenetic  noise 
was  added  to  the  Tree  of  Life  [90],  in  order  to  test  with  real-world  gene  trees  if  the 
GAnG  tree  search  process  could  be  attracted  towards  the  correct  species  tree.  This 
experiment  used  125  microbial  gene  families  randomly  selected  from  my  previous 
study  of  microbial  genome  evolution  [149].  A  total  of  100  sequenced  genomes  spanning 
all  three  domains  of  life  were  represented  in  these  gene  trees.  Because  the  true  species 
tree  for  the  100  sampled  genomes  is  not  known,  1  could  not  directly  evaluate  whether 
or  not  the  GAnG  algorithm  could  use  the  gene  trees  to  recover  the  true  species  tree. 
Instead,  I  tested  if  the  GAnG  algorithm  behaved  in  a  manner  consistent  with  less 
accurate  species  trees  receiving  higher  reconciliation  scores  than  more  accurate  species 
trees.  I  simulated  a  continuum  of  species  tree  accuracy  by  taking  a  topology  (the  Tree 
of  Life)  likely  similar  to  the  true  tree,  and  randomly  permuting  it  with  between  1  and 
10  SPR  random  moves.  This  procedure  was  used  to  generate  300  alternate  species 
trees.  Gene  tree  reconciliation  scores  against  the  species  trees  increased  linearly  with 
the  number  of  SPRs  performed  on  the  Tree  of  Life  (Figure  3.1;  K2  =  0.67),  suggesting 
that  more  accurate  species  trees  will  receive  lower  reconciliation  scores  than  less 
accurate  species  trees. 
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3.4.2  HGTs  can  be  used  to  find  a  better-scoring  tree 

The  GAnG  algorithm  was  run  to  completion  on  a  set  of  12  sequenced  archaeal  species 
and  250  gene  trees  randomly  chosen  from  my  previous  study  on  microbial  genome 
evolution  [149].  The  initial  species  tree  topology  was  pruned  from  the  Ciccarelli  & 
Bork  Tree  of  Life  (Fig.  3.2A)  [90].  The  algorithm  terminated  after  four  iterations, 
yielding  a  refined  species  tree  whose  reconciliations  with  gene  trees  were  an  average  of 
3.7%  lower  than  the  initial  tree  (Figure  3.2B).  A  single  SPR  move  of  the  Crenarchaeota 
to  a  basal  enryarchaeal  lineage  provided  the  most  dramatic  change  in  the  refined 
topology,  shifting  N.  equitans  to  a  basal  position  on  the  archaeal  tree. 
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3.5  Discussion 


GAnG  employs  the  AnGST  reconciliation  model  to  become  the  first  GTP-based  su¬ 
pertree  method  capable  of  accounting  for  HGT  events  and  therefore  appropriate  for 
inferring  prokaryotic  species  trees.  Usage  of  the  AnGST  model  carries  two  additional 
benefits.  First,  the  AnGST  algorithm  includes  a  bootstrap  amalgamation  step  that 
constructs  a  chimeric  gene  tree  from  the  bootstrap  subtrees  that  best  conform  to  the 
species  tree.  As  a  result,  gene  families  that  evolved  by  vertical  descent,  but  whose 
phylogenetic  reconstructions  are  sensitive  to  sequence  sampling  variation,  will  be  less 
likely  to  mislead  the  GTP  process.  Second,  HGT  events  inferred  by  AnGST  are 
directed  between  specific  branches  on  the  species  tree  and  can  therefore  provide  a 
heuristic  guide  for  refining  species  trees.  Frequent  HGTs  between  two  lineages  may 
be  the  result  of  a  topological  error  on  the  species  tree  and  can  be  resolved  by  making 
these  lineages  sibling  to  one  another  on  an  updated  tree. 

Preliminary  analyses  of  GAnG  on  simulated  data  demonstrate  that  the  algorithm’s 
scoring  function  will  be  attracted  towards  correct  species  topologies  (Fig.  3.1).  The 
observed  linear  relationship  between  species  tree  score  and  tree  randomization  demon¬ 
strates  also  suggests  that  GAnG  can  accurately  quantify  relative  differences  in  inac¬ 
curacy  among  trees.  Thus,  future  implementations  of  GAnG’s  tree  search  algorithm 
may  be  able  to  employ  more  sophisticated  search  algorithms,  such  as  gradient  descent. 

The  GAnG  analysis  on  real-world  data  showed  the  algorithm  capable  of  inferring 
a  prokaryotic  species  tree  largely  in  line  with  prior  archaeal  phytogenies,  despite  not 
relying  on  a  “core”  set  of  gene  families  free  of  HGT  (Fig.  3.2).  A  single  SPR  move  of 
the  Crenarchaeota  to  a  basal  euryarchaeal  lineage  provided  the  most  dramatic  refine¬ 
ment  to  the  starting  Ciccarelli  &  Bork  archaeal  tree.  This  SPR  shifted  N.  equitans 
from  its  initial  sibling  position  with  the  Crenarchaeota  (a  position  not  supported  by 
the  archaeal  phylogenetic  literature)  to  a  more  basal  position  outgrouping  both  the 
Euryarchaeota  and  the  Crenarchaeota.  This  outgroup  location  of  N.  equitans  has 
been  reported  previously  [155],  but  remains  controversial  [156].  The  refined  archaeal 
tree  also  includes  a  paraphyletic  euryarchaeal  clade  that  features  the  Thermoplas- 
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matales  and  Crenarchaeota  as  sibling  to  one  another.  A  close  relationship  between 
the  Thermoplasmatales  and  the  Crenarchaeota  has  been  reported  by  previous  phy¬ 
logenetic  studies  [157-159]  and  has  been  attributed  to  HGT  between  thermoplasma 
species  and  the  Crenarchaeota  [138]. 

Finally,  these  experiments  both  indicate  that  GAnG  could  be  run  on  datasets 
composed  of  thousands  of  gene  trees.  The  analysis  of  the  archaeal  tree  with  250  gene 
trees  took  on  the  order  of  3  days  on  a  computer  cluster.  GAnG  running  time  increases 
linearly  with  the  number  of  gene  trees,  so  analysis  of  a  1000  gene  tree  set  would 
require  approximately  two  weeks  of  compute  time  -  a  time-consuming  experiment, 
but  not  prohibitively  long.  Moreover,  a  potential  running  time  speedup  is  currently 
in  development  (see  Future  Directions  below). 
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3.6  Future  Directions 


Future  studies  will  investigate  several  potential  improvements  to  the  GAnG  model, 
specifically  in  species  tree  scoring,  running  time  improvement,  and  gene  tree  weight¬ 
ing.  The  benefits  of  each  of  these  model  changes  will  be  investigated  using  gene  and 
species  tree  simulation  software  that  1  have  previously  written  during  the  development 
of  AnGST  [149], 

A  more  accurate  and  more  efficient  version  of  GAnG  will  also  be  tested  on  a 
much  larger  archaeal  dataset  of  9053  gene  families  drawn  from  70  sequenced  archaeal 
genomes  [160].  The  resulting  archaeal  tree  will  be  used  to  test  hypotheses  involving 
the  basal  position  of  the  Korarchaeota  and  Thaumarchaeota  on  archaeal  phytogenies 
and  whether  N.  equitans  represents  a  separate  archaeal  phylum. 

3.6.1  Model  improvements 

Scoring  modifications 

The  scoring  function  Eq.  3.1  may  be  biased  by  larger  gene  trees,  which  are  likely  to 
have  a  higher  variance  in  reconciliation  scores.  An  alternative  scoring  function  robust 
to  this  bias  is: 


S(T)  =  Sign(Rec(G|T)  -  Rec(G\R))  (3.2) 

G 

This  score  is  negative  if  there  are  more  gene  trees  whose  reconciliation  scores  arc  lower 
with  the  candidate  tree  than  with  the  present  species  tree  R.  In  this  case,  a  candidate 
species  tree  would  be  accepted  and  become  the  new  present  species  tree.  Evidence 
that  S(T)  will  be  positive  for  incorrect  candidate  trees  is  presented  in  Section  3.4.1 
of  the  Preliminary  Results.  Summary  of  a  tree’s  fitness  using  the  sign  function  also 
facilitates  an  algorithmic  speedup  described  below  in  Reducing  running  time. 

Alternatively,  if  we  assume  that  reconciliation  score  variance  grows  with  reconcili¬ 
ation  score,  exceptional  changes  in  reconciliation  score  can  be  captured  by  the  scoring 
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metric: 


In  contrast  to  Equation  3.1,  however,  this  scoring  function  may  be  biased  by  smaller 
gene  trees,  which  are  likely  to  have  smaller  reconciliation  scores. 

Reducing  running  time 

I  can  avoid  reconciling  the  entire  set  of  gene  trees  if  it  can  be  quickly  determined 
that  a  candidate  species  tree  is  less  fit  than  the  current  species  tree.  I  can  perform 
this  speedup  using  the  scoring  function  in  Equation  3.2  and  by  assuming  that  gene 
trees  evolve  independently  from  one  another.  Under  this  model,  determining  the  sign 
of  S(T )  can  be  likened  to  the  problem  of  determining  if  a  coin  is  fair  using  a  finite 
number  of  coin  tosses.  To  compute  if  there  is  a  bias  towards  [Rec(GjT)  —  Rec(G|l?)] 
being  positive  or  negative,  we  can  count  the  total  number  of  G  for  which  this  value  is 
negative,  1V_,  or  non-zero,  N,  and  estimate  the  probability  that  a  candidate  species 
tree  that  changes  the  reconciliation  score  of  G  causes  a  lower  reconciliation  score: 

N_  ,  . 

p= -w  {3A) 

However,  p  is  an  estimated  probability  with  a  standard  error  sp 

*  =  <^) 

and  a  maximum  error  of 

E  =  Z  x  sp  (3.6) 

where  Z  is  taken  from  a  table  of  Z-values  and  associated  confidence  levels,  calculated 
using  a  normal  distribution  [75]. 

If  p  —  E  >  0.5,  the  candidate  topology  can  be  accepted  at  the  chosen  confidence 
level.  Similarly,  if  p  +  E  >  0.5,  the  candidate  topology  can  be  rejected.  Otherwise, 
the  maximum  error  is  too  high  to  determine  if  the  tree  should  be  accepted  or  rejected 
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and  more  gene  trees  need  to  be  reconciled.  One  way  to  determine  the  additional 
number  of  reconciliations  to  perform  in  this  case  would  be  to  calculate  E  =  1 0.5  —  p |, 
or  the  maximum  error  associated  with  p  so  that  it  can  be  determined  if  p  is  positive 
or  negative,  ft  can  be  shown  that  the  number  of  reconciliations  necessary  to  achieve 
this  error,  NE,  is  equal  to: 


Ne  = 


Z2p(l  —  p) 

e2 


(3.7) 


Gene  tree  weighting 

Core  gene  approaches  to  species  tree  construction  usually  exclude  gene  families  that 
show  evidence  of  HGT  [90] .  However,  the  strict  exclusion  of  gene  families  that  carry 
only  weak  signals  of  HGT  may  be  overly  conservative.  A  single  HGT  event  causes 
only  one  bifurcation  on  a  gene  tree  to  not  be  explained  by  vertical  inheritance  (i.e. 
a  speciation  event).  A  large  gene  tree  with  relatively  few  HGT  can  thus  still  be 
informative  for  phylogenomic  purposes.  This  intuition  can  be  incorporated  into  a 
scoring  function  by  multiplying  each  gene  tree  score  by  a  weighting  factor  w(G).  One 
method  for  calculating  this  weight  would  be  to  take  the  ratio  of  the  number  of  HGT 
possible  given  a  gene  family’s  taxonomic  distribution  (which  can  be  approximated  by 
the  square  of  the  number  of  species  L  represented  in  the  gene  family),  to  the  number 
of  HGT  inferred  on  the  gene  tree,  HGT(G): 

=  H§S) 

Translation-related  gene  families,  which  have  been  previously  been  relied  upon  for 
constructing  archaeal  phytogenies  [138],  are  weighted  more  highly  according  to  this 
scheme  than  the  average  gene  family  (p  <  ICG30,  Rank  Sum  test;  Fig.  3.3).  A  caveat 
to  this  weighting  scheme,  however,  is  that  HGT(G)  will  change  as  the  species  tree  is 
refined  by  GAnG. 

I  will  also  evaluate  via  simulation  an  alternative  gene  tree  weighting  function  that 
uses  gene  tree  branch  lengths  to  downweight  the  influence  of  gene  families  suspected  of 
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bearing  transferred  or  duplicated  genes.  This  approach  relies  on  the  assumption  that 
genetic  distances  between  gene  copies  descended  from  an  HGT  event  will  be  shorter 
than  expected,  whereas  genetic  distances  between  certain  gene  copies  descended  from 
a  duplication/loss  scenarios  will  be  longer  than  expected  (see  Figure  3.4  for  an  il¬ 
lustration  of  these  phenomena).  The  following  weighting  function  accounts  for  this 
evidence  of  non- vertical  descent: 

w(G)  =  1/ min  ( rD(sj ,  s3-|(j)  —  D(si,Sj\R))  (3.9) 

SiySj 

This  metric  iterates  over  all  pairs  of  species  st  and  s3  represented  in  a  gene  tree  G, 
and  computes  their  pairwise  distance  on  G  using  the  function  D.  The  difference 
between  this  distance  and  the  expected  distance  (taken  from  the  reference  tree  R)  is 
then  summed.  To  account  for  gene-specific  rates  of  evolution,  a  rate  term  r  is  fit  to 
G  using  a  minimization  function. 

3.6.2  A  tree  of  all  sequenced  archaea 

Once  1  have  completed  optimizing  the  GAnG  algorithm,  1  will  construct  an  archaeal 
phylogeny  using  the  arCOG  dataset  of  9504  archaeal  gene  families,  which  span  88% 
of  sequenced  archaeal  genomes  [160].  Due  in  part  to  the  challenges  of  cultivating 
archaeal  species  [161],  only  70  genomes  have  so  far  been  sequenced  from  this  domain 
of  life;  4  of  these  genomes  are  the  sole  representatives  of  3  of  the  5  known  archaeal 
phyla  [155,  162-164]  (Fig.  3.5).  This  uneven  taxon  sampling,  along  with  suspected 
HGT  events  and  unequal  rates  of  lineage  evolution,  makes  resolution  of  deep  nodes 
on  the  archaeal  phylogeny  sensitive  to  the  choice  of  analyzed  genes.  For  example,  a 
phylogeny  built  from  a  core  set  of  27  large  and  23  small  ribosomal  subunit  proteins 
supports  a  basal  position  of  N.  equitans  on  the  archaeal  tree,  but  the  phylogeny  of 
just  the  small  subunit  proteins  suggests  this  species  may  only  be  a  rapidly-evolving 
euryarchaeote  [156]. 

An  archaeal  species  tree  built  using  the  full  arCOG  dataset  should  correctly  posi¬ 
tion  N.  equitans ,  unless  phylogenetic  artifacts  systematically  bias  a  large  proportion 
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of  genes  in  archaeal  genomes.  This  tree  may  help  resolve  ongoing  debates  regarding 
the  origins  of  other  archaeal  phyla.  Previous  studies  with  varying  sets  of  core  genes 
have  placed  the  Korarchaeota  either  basal  on  the  archaeal  tree  [165,  166],  deep  within 
the  Crenarchaeota  [162],  or  sibling  to  the  Euryarchaeota  [167].  Core  gene  studies  have 
also  reported  conflicting  positions  for  the  Thaumarchaeota,  positioning  the  phylum 
either  within  the  Crenarchaeota  [162]  or  basal  on  the  archaeal  tree  [167,  168].  The 
GAnG  algorithm  offers  a  quantitative  method  for  testing  hypotheses  of  korarchaeal 
and  thaumarchaeal  origins  against  thousands  of  gene  trees.  Lessons  learned  from  re¬ 
solving  these  questions  in  archaeal  history  may  also  prove  insightful  for  future  efforts 
that  use  GAnG  to  reconstruct  a  larger  three  domain  Tree  of  Life. 
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Figure  3.1:  Average  AnGST  gene  tree  reconciliation  scores  as  species  tree  ran¬ 
domization  increases:  Between  1-10  random  SPR  moves  were  made  to  300  copies  of  the 
Tree  of  Life  [90] ,  to  produce  a  continuum  of  species  tree  accuracy.  Each  of  the  these  species 
trees  was  reconciled  against  a  set  of  125  gene  trees  and  the  average  AnGST  score  for  each 
reconciliation  is  plotted  on  the  y-axis.  The  R 2  of  the  best  fit  line  (shown  in  red)  to  these 
data  is  0.67. 
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Figure  3.2:  An  archaeal  phytogeny  refined  using  250  gene  trees  and  HGT- 
proposed  SPRs:  (A)  Initial  phylogeny  of  archaeal  species  pruned  from  the  Ciccarelli  & 
Bork  Tree  of  Life  [90].  Crenarchaeal  lineages  are  labeled  in  blue,  euryarchaeal  lineages  are 
labeled  in  red,  and  the  nanoarchaeal  lineage  is  labeled  in  green.  The  most  dramatic  accepted 
SPR  move  on  this  topology  is  shown  using  a  black  arrow.  (B)  The  resulting  archaeal  tree 
after  4  iterations  of  tree  refinement.  The  average  AnGST  gene  tree  reconciliation  score 
using  this  refined  tree  decreased  3.7%  relative  to  the  initial  phylogeny. 
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Figure  3.3:  Inferred  HGT  as  a  function  of  the  number  of  species  represented 
in  a  gene  tree:  Each  of  the  9053  arCOG  gene  trees  was  reconciled  against  an  archaeal 
species  tree  (Fig.  3.5).  The  number  of  inferred  HGT  for  each  gene  family  is  plotted  against 
the  square  of  the  number  of  species  represented  in  the  gene  family  (a  rough  approximation 
of  the  number  of  possible  HGT)  on  a  log-scale.  The  201  gene  families  annotated  by  the 
arCOG  database  as  involved  in  translation  are  shown  in  red,  and  all  other  gene  families 
are  shown  in  blue.  Gene  tree  weights,  as  calculated  by  Eq.  3.8  are  significantly  higher  for 
translation-associated  gene  families  (p  <  10~30,  Rank  Sum  test). 
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Figure  3.4:  Expected  gene  tree  branch  lengths  following  duplication  or  HGT: 

Two  hypothetical  evolutionary  histories  are  shown  for  a  gene  family.  On  the  left,  an  ances¬ 
tral  duplication  event  (D),  followed  by  four  loss  events  (L),  creates  higher  than  expected 
genetic  distance  between  species  A  and  B,  and  between  species  C  and  D.  On  the  right, 
HGT  events  (T)  cause  lower  than  expected  genetic  distance  between  species  A  and  C,  and 
between  species  B  and  D. 
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Figure  3.5:  Maximum  likelihood  tree  of  archaeal  species:  This  tree  of  70  archaeal 
species  will  be  the  starting  topology  for  the  GAnG  analysis  of  9504  arCOG  gene  trees.  The 
tree  was  built  using  a  maximum  likelihood  analysis  of  53  concatenated  ribosomal  proteins 
[138,  160]. 
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This  thesis  contributes  new  algorithms  for  comparative  microbial  genomics,  partic¬ 
ularly  in  the  subfields  of  microbial  ecology  and  microbial  genome  evolution.  In  Chap¬ 
ter  1  of  Part  II,  I  presented  the  algorithm  AdaptML,  which  can  identify  genetically- 
and  ecologically-distinct  bacterial  populations.  I  used  this  tool  to  identify  clusters  of 
marine  vibrio  that  appear  to  have  differentiated  in  response  to  their  nutrient  prefer¬ 
ences.  Code  for  AdaptML  has  been  made  public  and  with  the  help  of  Albert  Wang,  an 
undergraduate  researcher  in  Eric  Aim’s  laboratory,  I  have  created  a  Webserver  where 
users  can  submit  their  own  phylogenetic  and  ecological  data  for  AdaptML  to  pro¬ 
cess  online  (http:/ / almlab.mit.edu/adaptml/).  Outside  investigators  have  used  these 
tools  to  run  AnGST  on  their  own  datasets.  Fred  Cohan’s  group  ran  the  algorithm  to 
End  ecotypes  in  the  genus  Bacillus  that  would  have  been  overlooked  with  traditional 
sequence  divergence-based  thresholds  for  calling  species  [11].  Oakley  and  colleagues 
used  AdaptML  to  help  confirm  evidence  for  niche  partitioning  among  sampled  Desul- 
fobulbus  [169].  Other  investigators  have  promoted  using  AdaptML  for  future  studies 
of  microbial  evolution  and  ecology.  Daniel  Falush  has  suggested  using  AdaptML  to 
infer  the  evolutionary  history  of  microbe-host  adaptation  [170],  and  Ford  Doolittle  & 
Olga  Zhaxybayeva  have  pointed  out  the  algorithm’s  advantages  over  strict  threshold- 
based  approaches  to  microbial  ecology  [171]. 

In  Chapter  2,  I  introduced  the  algorithm  AnGST,  which  reconciles  an  ultrametric 
reference  tree  and  a  gene  tree  to  infer  HGT,  gene  duplication,  and  gene  family  birth 
events  in  a  chronological  context.  Implicit  in  these  reconciliations  are  known  con¬ 
straints  on  organismal  history  drawn  from  paleontological  and  geochemical  records. 
Analysis  of  100  sequenced  genomes  with  AnGST  produced  evidence  for  a  massive  ex¬ 
pansion  of  microbial  genetic  diversity  during  the  Archean  eon,  as  well  as  the  gradual 
oxygenation  of  the  biosphere  over  the  past  3  Ga.  This  later  finding  is  in  agreement 
with  other  studies  that  suggest  secular  changes  in  earth  geochemistry  were  recorded 
in,  and  can  now  be  mined  from,  microbial  genomes  [81,  83,  172,  173].  Further  evidence 
of  the  link  between  the  AnGST  analysis  and  biogeochemistry  could  be  uncovered  by 
follow  up  studies  on  enzyme  families  whose  first  appearance  can  be  dated  using  the 
sedimentary  record.  For  example,  Form  1  RuBisCO  makes  specific  contacts  with 
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molecular  oxygen  and  first  appeared  2. 7-2. 9  Ga  according  to  carbon  isotope  fraction¬ 
ation  results  [174],  AnGST’s  predicted  birth  date  for  this  enzyme’s  small  subunit, 
between  2. 0-3.0  Ga,  is  wide,  but  not  incompatible  with  the  fractionation  results. 
Identification  and  analysis  of  other  biomarkers  whose  age  can  be  independently  ver¬ 
ified  by  AnGST  will  lead  to  potentially  fruitful  collaborations  between  genomicists, 
paleontologists,  and  biogeochemists. 

Lastly,  in  Chapter  3,  I  introduced  the  supertree  algorithm  GAnG,  which  is  capable 
of  constructing  prokaryotic  species  trees  from  thousands  of  gene  trees.  Preliminary 
GAnG  analysis  of  250  archaeal  gene  trees  built  from  a  subset  of  the  sequenced  ar- 
chaeal  genomes  supports  the  hypothesis  that  Nanoarchaeota  diverged  from  the  last 
ancestor  of  the  Archaea  prior  to  the  Crenarchaeota/Euryarchaeota  split,  but  also 
appears  sensitive  to  HGT  from  the  Crenarchaeota  to  the  Thermoplasma.  Further 
improvements  to  the  GAnG  scoring  function  and  gene  tree  weighting  scheme  are  on¬ 
going.  An  improved  version  of  GAnG  will  be  used  to  build  a  species  tree  relating 
70  sequenced  Archaea  from  all  five  known  phyla.  One  important  observation  during 
that  tree’s  construction  will  be  the  topology  of  the  species  tree  scoring  landscape  near 
variations  on  deep  node  branching  order.  A  relatively  flat  landscape  in  that  regime 
may  suggest  that  the  phyla  radiated  so  rapidly  during  early  archaeal  evolution  that 
their  branching  order  cannot  be  resolved  [141]  or  that  HGT  dominated  vertical  de¬ 
scent  during  the  formation  of  the  archaeal  phyla  [175].  By  contrast,  a  bowl-shaped 
landscape  with  a  clear  scoring  minimum  will  provide  support  that  a  Tree  of  Life  exists 
for  the  Archaea,  and  will  encourage  attempts  to  construct  a  universal  Tree  of  Life 
with  sequenced  genomes  from  all  three  domains  of  life. 
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